Sunteți pe pagina 1din 13
ErMolcon John Molson Department of Supply Chain and Business Technology Management Arh VQtynr MBA 643 Fall 2018 Assignment Due Wednesday October 10, 2018 Sec b rons Question 1 ‘The heights of a group of athletes are modeled by a normal distribution with mean 182 cm and standard deviation 5.0 cm. The weights of this group of athletes are modelled by a normal distribution with mean 90 kg and standard deviation 7.0 kg. a) Find the probability that a randomly chosen athlete is taller than 188 cm PCH>t t 18 aS (Zr).2)= 0 115 b) What is the height exceeded by 15% of athletes? 182. + 1,0 365 = 164, 18 cm c) Assuming that for these athletes, height and weight are independent. find the probability that a randomly selected athlete is taller than 188 cm and weighs more than 97 kg. rea (47188 0 W FA) = TD) Vente 42 rE Ale 2)e(z =) WUSIM OISRE = 9» & DiGPasiiez.) ? eo d) If 10 athletes are selected at random for a special competition, find the probability that less than two of those them are taller than 188 cm. p 2) Hx 20) th )= : ‘ 4 dy lo 45, ! NCUA USY + (us) ( CaO wen e) If 10 athletes are selected, what is the probability that their average weight if above Ske? \/— : GS Ice cream is an all-time favorite treat. A sample of prices of 10 ice cream makers was obtained: Derive a 95% confidence interval for the average price of an ice cream maker. Question 4 Attracted by the possible returns from a portfolio of movies, hedge funds have invested in the movie industry by financially backing individual films and/or studios. The hedge fund Star Ventures is currently conducting some research involving movies involving Adam Sandler, an American actor, screenwriter, and film producer. As a first step, Star Ventures would like to cluster Adam Sandler movies based on their gross box office returns and movie critic ratings. Using the data in the file Sandler, apply k-means clustering to characterize three different types of Adam Sandler movies using 10 random starts (12345) and 10 iterations using the variables Rating and Box. Rating corresponds to movie ratings provided by critics (a higher score represents a movie receiving better reviews). Box represents the gross box office earnings in 2015 dollars. Normalize the values of the input variables to adjust for the different magnitudes of the variables. Answer the following questions: a) — Which random start is the best? Why? We normalize the data amd request 3 clusters with 10 random starts (12345) K-Means Clustering: Fitting Parameters 4 Clusters 3 Start type Random Start Iterations 10 Random seed: initial centroid: 12345 Question 2 An auto insurance company charges younger drivers a higher premium than it does older drivers because younger drivers as a group tend to have more accidents. The company has 3 age groups: Group A includes those under 25 years old, 20% of all its policyholders, Group B includes those 25-39 years old, 45% of ail its policyholders, Group C includes those 40 years old and older. Company records show that in any given one-year period, 10% of its Group A policyholders have an accident. The percentages for groups B and C are 5% and 3%, respectively. a) b) c) What percent of the company’s policyholders are expected to have an accident during the next 12 months? | es | ‘ Y pec) os t . je Md)e OL Suppose Mr. X has just had a car accident. If he is one of the company’s policyholders, what is the probability that he is under 25? P( | 2 ap Say that this company not only classifies drivers by age, but in the case of drivers under 25 years old, it also notes whether they have had a driver’s education course. One quarter of its policyholders under 25 have had a drivers’ education course and 5% of these have an accident in a one-year period. Of those under 25 who have not had a driver’s education course, 13% have an accident within a one-year period. A 20-year-old woman takes out a policy with this company and within one year she as an accident. What is the probability that she did not have a driver’s education course? A 0 (riolaeutse | Ace) The following random starts are obtained: Random Start Summary: Start 1. Sum of Squares: 43.84047 | 0.3601301 GUSES | 0.647436 etn © | -0.8848625 -1.214113202 -0.651553809 1508798239 Start 2. Sum of eae re 8706 LE Jo. 0728241 Tae a -0.3102505 a | 1.6530069 2060545, -0.761586087 -0.866969114 Start 3. Sum of tn | 44.35327 Tuan 0.6053204 2 | -1.2679371 J 0.5995517 ES 1149256148 0.002440857 ~-0.389645993 Start 4. Sum of Squares: 56.96349 i E] 23712719 Start 5. Sum of Squares: 2] -0.7412095 2) -1.3158215 a ue i | -1.0763998 iC] -0.9327468 AESETS) -0.2144819 0924542341 ~1.105630674 -0.850541759 111.6261 0.904395586 1530339769 0673482777 {TE -0.8848625 AIS | -0.2144819 OES | 0.4080144 -0.651553809 0.673482777 -0.487280267 Start 7. Sum of Squares: 74.120379 -0.3581349 =} 0.0249398 ©) -0.4539035 -0.369499237 -0.010732021 -0.622108551 Start 8 - Sum of Squares: 51. 108625 ee “GISE 0.6953204 — 1.149256148 aa -0.2144819 0.673482777 -0.3102505 -0.761586087 Start 9. sum of Squares: 60.6281 Ta ES ee “GRACE -0.7412095 _-0.493479269 “GUNES? | 0.6953204 = 1.149256148 GOSEE] 0.7432047 1172502404 | ‘Start 4 Sum of juares: 67.285636 vali ar as 0.932768 1.530339769 | 0.5995517 _-0.389645993 GSES] 06953204 1.149256148 Start 6 is the best since it has the smallest Sum of Squares. Which cluster is the most homogeneous? Cluster Summary 24 0.5723 15 0.7441 9 0.7110 48 0.6520 Based on the Average distance within clusters, we conclude that Cluster 1 is the most homogeneous since it has the smallest average distance.(0, s 41 3) Which pairs of clusters are the most different? Inter-Cluster Distances 2.2087 2.3719 0 The Clusters 2 and 3 are the most different because the inter-cluster distance is the largest ¢ aoSTi ) What is the strength of each cluster? interpret The strength of a cluster is calculated by taking the ratio of the Inter-cluster distances and the average distances. Strength of Cluster 1 compared to Cluster 2: 2.0292 / 0.5723 = 3.5457 On average an observation in Cluster 1 is 3.5457 times closer to the Cluster 1 centroid than to the Cluster 2 centroid Strength of Cluster 1 compared to Cluster 3: 2.2087 / 0.5723 = 3.8593 On average an observation in Cluster 1 is 3.8593 times closer to the Cluster 1 centroid than to the Cluster 3 centroid Strength of Cluster 2 compared to Cluster 1: 2.0292 / 0.7441 = 2.7271 On average an observation in Cluster 2 is 2.7271 times closer to the Cluster 2 centroid than to the Cluster 1 centroid Strength of Cluster 2 compared to Cluster 3: 2.3719 / 0.7441 = 3.1876 On average an observation in Cluster 2 is 3.1876 times closer to the Cluster 2 centroid than to the Cluster 3 centroid Strength of Cluster 3 compared to Cluster 1: 2.2087 / 0.7110 = 3.1065 On average an observation in Cluster 3 is 3.1065 times closer to the Cluster 3 centroid than to the Cluster 1 centroid Strength of Cluster 3 compared to Cluster 2.3719 / 0.7110 = 3.3360 On average an observation in Cluster 3 is 3.3360 times closer to the Cluster 3 centroid than to the Cluster 2 centroid How would you characterize the movies in each cluster? Classify the observations Combing the original data with the normalized classification, we get: I ‘Title i Rating || Box 1] | 1 Shakes the Clown 37 02 Ea Coneheads 34 35.1 | 4 Airheads 21 93 ] 1° Mixed Nuts a i | 4 Bulletproof 8 32.8 1 Dirty Work 7 147 1 Little Nicky 2 54.7 1 Joe Dirt u 36.5 1 The Animal 30 77.85 1 Bight Crazy Nights 12 313 1 The Hot Chick 21 46.5 SEE] 1 The Master of Disguise 1 53.5 i i 1 Dickie Roberts: Former Child Star 23 29.4 i} Deuce Bigalow: European Gigolo 9 274 1 Grandma's Boy 18 7 1 The Benchwarmers i 68.3 1 Strange Wildemess 2 72 1 Jack and Jill Sh 78.7 1 Zookeeper 14 85.3 1 Here Comes the Boom 38 47.1 4 That's My Boy 20 38.4 1 Blended 14 46.7 1 Men, Women & Children 31 0.72 1 Paul Blart: Mall Cop 2 6 43.2 2 The Waterboy 35 236.43 2 Big Daddy 40 234 2 Mr. Deeds 22 167.5 2 Anger Management 43 175.9 2 50 First Dates 44 152.7 a The Longest Yard 31 193.2 2 Click 32 162.6 2 I Now Pronounce You Chuck & Larry 14 138.2 2 Bedtime Stories 25 122 2 You Don't Mess With the Zohan 38 110.9 2 Paul Blart: Mall Cop 33 162.7 2 Grown Ups 10 177.29 2 Just Go with It 19 109.3 2 Hotel Transylvania 45 154.2 2 Grown Ups 2 7 136.9 3) Billy Madison 46 40 3 Happy Gilmore 60 59 3 The Wedding Singer 67 117.5 3 Punch-Drunk Love 719 23.66 3 Spanglish 53 34 3 Reign Over Me 64 22.6 3 The House Bunny 42 53.4 3 Funny People 68 57.69 3. Top Five 88 25.5 Performing a Pivot table we find the average Box and Rating for each cluster. ‘Average of | Average of Clusters Box Rating Count 1 36.786 17.083 24 [2 162.255 29.200 15 3 50.372 63.000 9 Total 78.543 29.479 48 Cluster 1 is characterized by low-rated movies that generated little revenue at the box office. This cluster represents half of the movies in the data. Cluster 2 represents moderate-rated movies with large box office revenue and contains 15 out of the 48 observations. Cluster 3 contains highly-rated movies that still generate relatively low box office revenue. Question 5 Apple Inc. tracks online transactions at its iStore and is interested in learning about the purchase patterns of its customers in order to provide recommendations as a customer browses its web site. A sample of the “shopping cart” data resides in the file AppleCart. Use a minimum support of 10% of the total number of transactions and a minimum confidence of 60% to generate a list of association rules. a) Whatare the rules satisfying the above conditions in decreasing order lift ratios? Association Rules: Fitting Parameters Method Apriori Min support 200 (10%) | Min confidence 60 | Gu 2000 ia j 12 fs 4 8 325 552 204 62.7692 2.2742 [RetinaDisplay,Stand) _[Speakers] 255 846 204 80 1.8013 [Stand,Speakers] _[RetinaDisplay] 282 846 209 74.1135 1.7521 (32GB,Speakers] __[RetinaDisplay] 552 846 390 70.6522 1.6703 [Speakers] [RetinaDisplay} 429 846 303 70.6294 1.6697 [Case] [RetinaDisplay] 482 846 325 67.4274 1.5940 [Stand] (RetinaDisplay] 495 46 330 66.6067 1.5760 {Cellular} [RetinaDisplay] 361 846 225 62.3269 1.4734 [64G8) [RetinaDisplay] The above 8 rules have a minimum confidence of 60% and minimum support of 10% (200) b) _ Interpret what the rule with the largest lift ratio is saying about the relationship between the antecedent item set and consequent item set. Ma 525 552 204 62.7692 2.2742 (RetinaDisplay,Stand) [Speakers] Antecedent: RetinaDisplay, Stand ; Consequent: Speakers. If an iPad with retina display is purchased with a stand, then speakers are also purchased. Interpret the confidence of the rule with the largest lift ratio. The confidence of this rule 62.77%, which means 62.77% of the purchases that included RetinaDisplay and Stand also included Speakers. Interpret the lift ratio of the rule with the largest lift ratio. The lift ratio = 2.274, which means that a customer who has purchased a retina display and a stand is 127.4% more likely than a randomly selected customer to buy speakers. Review the rules obtained and summarize what the rules suggest. The 8 rules have many features in common that allow Apple to focus on these collections. For instance, an iPad with retina display is often purchased along with other accessories, including a stand, speakers, cellular service, and a case. An iPad with retina display often is paired with a memory upgrade. A memory upgrade to 32 GB is commonly associated with speakers and/or retina display. Commonly associated accessories include stand and speakers or case and speakers. Question 6 The online review service Yelp helps millions of consumers find the goods and services they seek. To help consumers make more-informed choices, Yelp includes over 120 million reviews. Below is a table that contains a sample of 15 reviews for an Italian restaurant. Normalize the terms by using stemming and a generate binary term-document matrix. a) — What are the five most common terms in these reviews? How often does each term appear? Rath annervravl 56 10 21 a 18 3 The five most common terms are: Food, Pizza, Good, Chicken, Delicious appearing 4, 4, 3, 2 and 2 times respectively Review | NORE RNOaD 1 ] b) Apply K-Means clustering (normalize, using 10 random starts (12345) and 10 iterations) to yield two clusters from the presence/absence term-document matrix using all five of the most common terms from the reviews. How many documents are in each cluster? Give a description of each cluster. The following data will be used: CES ee O IRE Ch EEN Ba Plex ieee TTC emery IEC RECA Gren! e0c00000++000000 sccoeccocoa0eccosK5 cece s0+ecc0000+ +s 2©000004+000+00 2000+00000+4c00 ing Parameters “HClusters |2 Start type _ Random Start iterations eee Random seed: initial centroids RE a ee Ce " 1.7896 crea 4 1.5065 fra 15 4.7141 ‘There are 11 documents in the first cluster and 4 in the second cluster Merging the comments with the classification matrix, we find: Food is delicious and fresh! Atmosphere was great the food was fantastic , The chicken was too dry . [CSSTEE J The chicken did not taste good + Nice people good food LESENE | Fantastico Amazing pasta and meatballs, KSSIENEI] Service takes forever and the place is filthy and loud. Su staff Really great pizza ! The service was Good TSECEZET | west pizza ever! 3) ) ZEEE The stuted pizza is excellent! ee Pizza is delicious We note that Cluster 2 is about the Pizza, while Cluster 1 is about other aspects of the restaurant.

S-ar putea să vă placă și