Sunteți pe pagina 1din 93

Basic Statistics

(BST)
Dr. Pritha Guha
Session: 1,2
Teaching and Grading
Pritha Guha (email: pritha@xlri.ac.in)

Text Book: Bowerman B., O’Connell R., Murphy E., Business Statistics in Practice, 8th ed.,
McGraw Hill Education (India)

We would be using the software R.

Grading:
Mid term (based on session 1-8): 25%
End term (based on all the sessions): 75%, Take home exam

Students are required to bring their laptops for the sessions. For the Midterm they have
bring non programmable scientific calculators. Laptops would not be allowed.

2 Basic Statistics, June 2019


R Books: The R Book– Crawley
R in Action– Kabakoff

3 Basic Statistics, June 2019


Summarizing data
What is Data?
• Facts and figures collected, analysed
and summarized for presentation and
interpretation.
• All the data collected in a particular
study are referred to as the data set
for the study.

5 Basic Statistics, June 2019


Data Sources
• Existing Sources
• Data Repositories (Kaggle(https://www.kaggle.com/),
UCI (https://archive.ics.uci.edu/ml/index.php))

• Experimental and observational studies


• A beverage company investigates consumer reaction to a new bottle design for one
of its popular soft drinks

• Transactional data, data warehousing and big data


• Disney Parks Case: Improving Visitor Experience
• Amazon Go (https://www.youtube.com/watch?v=NrmMk1Myrxc)

6 Basic Statistics, June 2019


A new bottle design for a popular soft drink: Consumer
reaction
• Gender: Male____, Female_____
• Age: ________
• How many bottles of this soft drink you consume each month?____

7 Basic Statistics, June 2019


How to do it?

•Find the right data.


•Use the appropriate statistical tools.
•Clearly communicate the numerical information into written
language.

8 Basic Statistics, June 2019


Two branches of statistics

•Descriptive Statistics: collecting, organizing, and presenting


the data.

•Inferential Statistics: drawing conclusions about a population


based on sample data from that population.

9 Basic Statistics, June 2019


Population and Sample

11 Basic Statistics, June 2019


Inferential Statistics
• Estimation
•Estimate the population mean weight
using the sample mean weight
• Hypothesis testing
•E.g.: test the claim that the population
mean weight is 70 kg

Inference is the process of drawing conclusions or making decisions about a


population based on sample results

12 Basic Statistics, June 2019


A new bottle design for a popular soft drink: Consumer
reaction
• Gender: Male____, Female_____
• Age: ________
• How many bottles of this soft drink you consume each month?____

14 Basic Statistics, June 2019


Data: Qualitative and Quantitative
• Data can be classified as being Qualitative/Categorical or Quantitative

Qualitative Data Quantitative Data


• Labels or names are used to identify • Quantitative data indicate how many
an attribute of each element (discrete) or how much (continuous)
• Can be either numeric or nonnumeric • Quantitative data are always numeric
• Appropriate statistical analyses are • Ordinary arithmetic operations are
limited meaningful for quantitative data

15 Basic Statistics, June 2019


Variables
• A variable is the general characteristic being observed on an object of interest.
• The statistical analysis that is appropriate depends on whether the data for the
variable are qualitative or quantitative.

16 Basic Statistics, June 2019


Exe: Qualitative/Quantitative (pg. 8)
• The net profit for a company in 2015
• The stock exchange on which a company’s stock is traded
• The national debt of the United States in 2015
• The advertisement medium (radio, television, or print) used to promote a product

17 Basic Statistics, June 2019


Scales of Measurement
• Scales of measurement refer to ways in which variables are defined and categorized.
• The scale determines the amount of information contained in the data.
• The scale indicates the data summarization and statistical analysis that are most
appropriate.

Scales of Measurement

Nominal Ordinal Interval Ratio

18 Basic Statistics, June 2019


Scales of Measurement: Nominal
• Data are labels or names used to identify an attribute of the element.
• A nonnumeric label or numeric code may be used.

Examples:
• Gender: Male____, Female_____
• Students in a college are classified by the department in which they are enrolled
using a nonnumeric label such as Architecture and Planning(AP), Management(M),
Law(L), Technology(T) and so on.
• Alternatively, assign a numeric code for school variable (eg 1 denoted Architecture
and Planning, 2 denotes Management, 3 denotes Law, 4 denotes Technology)

19 Basic Statistics, June 2019


Scales of Measurement: Ordinal
• Data have properties of nominal data and the order or rank of data is meaningful.
• A nonnumeric label or numeric code may be used.
• Example: Likert scale data

20 Basic Statistics, June 2019


Scales of Measurement: Ordinal
• Examples continued…
• Social economic class: Upper, Middle, Working

• Setback of this measurement scale: Differences between categories are


meaningless because the actual numbers used may be arbitrary.

21 Basic Statistics, June 2019


Scales of Measurement: Interval
• The data have properties of ordinal data and interval between observations is
expressed in terms of a fixed unit of measure.
• Always numeric.

Examples:
• Mili has a GATE score of 1205 while Kiran has a GATE score of 1090. Mili scored 115
points more than Kiran.
• Example: Temperature of Jamshedpur today at 8AM was 22oC, in Shimla it was
19oC.

22 Basic Statistics, June 2019


Scales of Measurement: Interval
• Data may be categorized and ranked with respect to some characteristic or
trait.
• Differences between interval values are equal and meaningful. Thus the
arithmetic operations of addition and subtraction are meaningful.
• No “absolute 0” or starting point defined. Meaningful ratios may not be
obtained.

23 Basic Statistics, June 2019


Scales of Measurement: Interval
•Consider the Fahrenheit scale of temperature.
•This scale is interval because the data are ranked and
differences (+ or ) may be obtained.
•But there is no “absolute 0” (What does 0F mean?)
80o F
•What does o
mean?
40 F

24 Basic Statistics, June 2019


Scales of Measurement: Ratio
• The data will have all the properties of interval data and the ratio of two
values is meaningful.
• Variables such as distance, height, weight use the ratio scale.
Example:
• Age: ________

25 Basic Statistics, June 2019


Scales of Measurement: Ratio
• The strongest level of measurement.
• Ratio data may be categorized and ranked with respect to some
characteristic or trait.
• Differences between interval values are equal and meaningful.
• There is an “absolute 0” or defined starting point. “0” does mean “the
absence of …” Thus, meaningful ratios may be obtained.

26 Basic Statistics, June 2019


Scales of Measurement
Nominal
Numeric
Ordinal
Qualitative

Nominal
Variables Non-
numeric
Ordinal

Interval
Quantitative Numeric
Ratio

27 Basic Statistics, June 2019


Exe: Nominal/Ordinal (pg. 26)
• Door Choice on a TV Show: Door #1, Door #2, Door #3
• TV show classification: TV-G, TV-PG, TV-14, TV-MA
• Zip Code: 45056, 90015, 127666
• Personal computer operating system: Windows XP, Windows Vista, Windows 7,
Windows 8, Windows 10
• Exchange on which a stock is traded: AMEX, NYSE, NASDAQ, Other
• Question: Are all numbers Quantitative variables:

28 Basic Statistics, June 2019


Types of Data
Time Series Data Cross-sectional Data Panel Data

• Data collected by • Data collected by • Panel data are collected


recording a characteristic recording a characteristic over several time period
of a subject over several of many subjects at the from several entities.
time periods. same point in time, or • Data dealing with the
• Data dealing with the without regard to number of building
number of building differences in time. permits issued in each of
permits issued in one • Data dealing with the several localities of
particular locality in number of building Jamshedpur in each of
Jamshedpur in each of permits issued in July the last 36 months.
the last 36 months. 2016 in each of several
localities of Jamshedpur.

29 Basic Statistics, June 2019


Exe: (pg. 13)
• If we record the total number of cars sold in 2015 by each of 10 car salespeople,
are the data
a) Time series b)Cross-sectional c)Panel Data?

• If we record the total number of cars sold by a particular car salesperson in each
of the years 2011, 2012, 2013, 2014 and 2015, are the data
a) Time series b)Cross-sectional c)Panel Data?

30 Basic Statistics, June 2019


Summarization of
Qualitative Data
Frequency Table and Graphical Representation
A new bottle design for a popular soft drink: Consumer
reaction

32 Basic Statistics, June 2019


A new bottle design for a popular soft drink: Consumer
reaction for 60 consumers
Consumer Q1 Q2 Q3 Q4 Q5
No.

1 6 4 4 2 4
2 7 3 4 3 3
3 4 2 6 4 2
4 3 3 6 6 2
5 4 6 2 5 2
6 6 3 5 5 2
33 Basic Statistics, June 2019
A new bottle design for a popular soft drink: Consumer
reaction

34 Basic Statistics, June 2019


Responses to Q1

6 3 1 7 5 2 5 2 3 7
7 2 5 3 6 7 7 5 7 6
4 7 2 1 6 2 2 3 1 1
3 5 7 4 2 7 4 6 5 7
4 2 5 7 6 4 5 7 3 5
6 4 7 7 6 7 4 6 1 4
35 Basic Statistics, June 2019
Computations Using R
• Download R from https://cran.r-project.org/

36 Basic Statistics, June 2019


37 Basic Statistics, June 2019
38 Basic Statistics, June 2019
Responses Frequency
Frequency Table
• Frequency Distribution: Groups data into
1 5
non overlapping categories and records
how many observations fall into each
2 8
category. 3 6
• Provides insights about the data that
cannot be quickly obtained by looking only 4 8
at the original data.
5 9
6 9
7 15
Total 60
39 Basic Statistics, June 2019
Responses Relative
Relative frequency Frequency
• The relative frequency of each 1 0.083
category equals the proportion
(fraction) of observations in each 2 0.133
category.
3 0.100
• It is calculated by dividing the
frequency of each category by the 4 0.133
total number of observations.
5 0.150
• The sum of the relative frequencies
should be equal to 1 or very close to 1. 6 0.150
7 0.250
Total 1.0

40 Basic Statistics, June 2019


Percentage Frequency
Responses
Percentage frequency
• The percentage frequency is the 1 8.333
percent (%) of observations in a 2 13.333
category.
• It equals the relative frequency of 3 10.0
the category multiplied by 100%. 4 13.333
5 15.0
6 15.0
7 25.0
Total 100.0

41 Basic Statistics, June 2019


Bar Charts
• On one axis (usually the horizontal
axis), we specify the labels that are
used for each of the classes.
• A frequency, relative frequency, or
percent frequency is shown on the
other axis (usually the vertical axis).
• Using a bar of fixed width drawn above
each class label, we extend the height
appropriately.
• The bars are separated to emphasize
the fact that each class is a separate
category.

42 Basic Statistics, June 2019


Horizontal Bar Plot

43 Basic Statistics, June 2019


Grouped Bar Charts

44 Basic Statistics, June 2019


Stacked Bar Plot

45 Basic Statistics, June 2019


Pie Chart Complaint Count

• First draw a circle Pizza not hot 600

• Then use the relative frequencies to Inadequate topping 105


subdivide the circle into sectors that quantity
correspond to the relative frequency Inadequate cheese 55
for each class. quantity
Not baked properly 12
No or less seasoning 75
• Since there are 360 degrees in a circle,
a class with a relative frequency of Delayed delivery 1200
0.25 would consume 0.25X 360 = 90 Incorrect billing 57
degrees of the circle.
Wrong size delivered 95
Others 100

46 Basic Statistics, June 2019


Complaint Relative Frequency
Pie Chart
Pizza not hot 0.261
• First draw a circle
Inadequate topping 0.046
• Then use the relative frequencies to
quantity
subdivide the circle into sectors that
correspond to the relative frequency Inadequate cheese 0.024
for each class. quantity
Not baked properly 0.005
• Since there are 360 degrees in a circle, No or less seasoning 0.033
a class with a relative frequency of
Delayed delivery 0.522
0.25 would consume 0.261X 360 =
93.96 degrees of the circle. Incorrect billing 0.025
Wrong size delivered 0.041
Others 0.043

47 Basic Statistics, June 2019


48 Basic Statistics, June 2019
Summarization of
Quantitative Data
Frequency Table and Graphical Representation
Frequency Table
• Groups data into intervals called classes and records the number of
observations that falls into each class.
• How to construct?
• Decide the number of non-overlapping classes.
• The classes are exhaustive.
• Determine the width of each class: take equal width for classes.
approx. class width =(Max-Min)/no. of classes
• Determine the class limits: each data point should be in exactly one
class; no more, no less.

50 Basic Statistics, June 2019


Example

51 Basic Statistics, June 2019


Frequency Distribution Classes Frequency
• Number of non overlapping
equal length classes? 10-13 3
• 2K rule 13-16 14
16-19 23
19-22 12
22-25 8
25-28 4
28-31 1
Total 65

52 Basic Statistics, June 2019


Cumulative Frequency Distribution
• Records the number of Classes Frequency Cumulative
observations that falls below the Frequency
upper limit of each class.
10-13 3 3
13-16 14 17
16-19 23 40
19-22 12 52
22-25 8 60
25-28 4 64
28-31 1 65
Total 65

53 Basic Statistics, June 2019


Relative and Cumulative Relative Frequency Distribution
• A relative frequency distribution identifies the proportion or fraction of values that
fall into each class.
• A cumulative relative frequency distribution gives the proportion or fraction of
values that fall below the upper limit of each class.

54 Basic Statistics, June 2019


Relative and Cumulative Relative Frequency Distribution
Relative
Relative Cumulative
Class Frequency Cumulative
Frequency Frequency
Frequency
10-13 3 0.046 3 0.046
13-16 14 0.215 17 0.261
16-19 23 0.354 40 0.615
19-22 12 0.185 52 0.8
22-25 8 0.123 60 0.923
25-28 4 0.062 64 0.985
28-31 1 0.015 65 1.0
Total 65
55 Basic Statistics, June 2019
Histogram
• Visual representation of a relative frequency distribution.
• Rectangles represent the classes
• The base represents the unit class length
• The height represents relative frequency of a frequency distribution

56 Basic Statistics, June 2019


57 Basic Statistics, June 2019
58 Basic Statistics, June 2019
Stem-and-Leaf Displays
• Purpose: to see the overall pattern of the data, by grouping the
data into classes
• the variation from class to class
• the amount of data in each class
• the distribution of the data within each class
• Best for small to moderately sized data distributions

59 Basic Statistics, June 2019


Stem and Leaf Plot: Bank Wait Time
Payment Time Data

22
19
16
18
13

60 Basic Statistics, June 2019


61 Basic Statistics, June 2019
Contingency Tables
• Classifies data on two dimensions
• Rows classify according to one dimension
• Columns classify according to a second dimension
• Requires three variable
• The row variable
• The column variable
• The variable counted in the cells

62 Basic Statistics, June 2019


Contingency Tables: Customer Satisfaction Survey
for different funds
Client Fund Type Level of Satisfaction
1 BOND HIGH
2 STOCK HIGH
3 TAXDEF MED
4 TAXDEF MED
5 STOCK LOW
6 STOCK HIGH
7 STOCK HIGH
8 BOND MED
9 TAXDEF LOW

63 Basic Statistics, June 2019


Contingency Tables: Customer Satisfaction Survey
for different funds
High Low Medium Total

Bond 15 3 12 30
Stock 24 2 4 30
Taxdef 1 15 24 40
Total 40 20 40 100

64 Basic Statistics, June 2019


Type of
Frequency Percent Cum. Percent
Defect
Pareto Chart
• A Pareto chart is a bar plot Crooked label 78 36.97% 37.0%
where the categories are
ordered in non increasing
order, and a line is also added
Missing Label 45 21.33% 58.3%
to show the cumulative sum.
Printing Error 33 15.64% 73.9%

Loose Label 23 10.90% 84.8%


Wrinkled
14 6.64% 91.5%
Label
Smudged
6 2.84% 94.3%
Label
Other 12 5.69% 100.0%
Total 211 100%
65 Basic Statistics, June 2019
Pareto Chart

66 Basic Statistics, June 2019


Descriptive Statistics
Numerical Methods
68 Basic Statistics, June 2019
69 Basic Statistics, June 2019
Some Notations

70 Basic Statistics, June 2019


Some Notations

71 Basic Statistics, June 2019


Measures of Central Location
• Mean
• Median
• Quartiles, Percentiles
• Mode

72 Basic Statistics, June 2019


Mean
• Primary measure of central location.
• Sample Mean x n

x i
x i 1
n
• Population Mean m
N

x i
m i 1
N
• Sensitive to outliers

73 Basic Statistics, June 2019


Exe: pg. 142
• Compute the mean :
A) 110, 120, 70, 90, 90, 100, 80, 130, 140
B) 110, 120, 70, 90, 90, 100, 80, 130, 140, 1120

74 Basic Statistics, June 2019


Median
• When the data are arranged in ascending order, the median is
• the middle value if the number of observations is odd, or
• the average of the two middle values if the number of observations is even.
• The median is another measure of central location that is not affected by
outliers.

75 Basic Statistics, June 2019


Exe: pg. 142
• Compute the median :
A) 110, 120, 70, 90, 90, 100, 80, 130, 140
B) 110, 120, 70, 90, 90, 100, 80, 130, 140, 1120

76 Basic Statistics, June 2019


Percentile
• pth percentile divides a data set into two parts:
• at least p percent of the observations have values less than
the pth percentile;
• at least (100  p ) percent of the observations have values
greater than the pth percentile.

77 Basic Statistics, June 2019


Exe: Problem 3.31 (pg. 160)
• Compute the 90th percentile, the 25th percentile, 75th percentile, 50th
percentile :
A) 152, 144, 162, 154, 146, 241, 127, 141, 171, 177, 138, 132, 192.

78 Basic Statistics, June 2019


Quartiles
• Quartiles are specific percentiles.
• First Quartile (Q1) = 25th Percentile

n  1 th ranked value
Q1 
4
• Second Quartile (Q2) = 50th Percentile = Median

• Third Quartile (Q3) = 75th Percentile

3(n  1)
Q3  th ranked value
4
79 Basic Statistics, June 2019
Mode
• The mode is another measure of central location.
• The most frequently occurring value in a data set
• Used to summarize qualitative data
• A data set can have no mode, one mode (unimodal), or many modes
(multimodal).

80 Basic Statistics, June 2019


Some more measures to consider…
• Besides knowing the central point of a data
set, we would also like to describe the data’s
spread or how far from the centre the data
tend to range.
• Consider our heights!
• Look at marks obtained by two sections of
students.
• Measures of dispersion gauge the variability of
a data set.

81 Basic Statistics, June 2019


Some Measures of Variability/Spread/Dispersion
• Range
• Interquartile Range (IQR)
• Variance
• Standard Deviation

82 Basic Statistics, June 2019


Range

• Range = Maximum Value – Minimum Value


• It is the simplest measure.
• It is focusses on extreme values.
• It is very sensitive to the smallest and largest data values.

83 Basic Statistics, June 2019


Interquartile Range(IQR)
• IQR = Q3- Q1

• IQR represents the middle half of the data.

84 IBS 2017
Variance
• The variance is a measure of variability that utilizes all the data.

• It is based on the difference between the value of each


observation (xi) and the mean (for a sample, µ for a population).

• The variance is useful in comparing the variability of two or more


variables.

85 Basic Statistics, June 2019


Standard Deviation
• The standard deviation of a data set is the positive square
root of the variance.
• It is measured in the same units as the data, making it more
easily interpreted than the variance.

86 Basic Statistics, June 2019


Variance and Standard Deviation(SD)
For a given sample, For a given population,
• n = total number of samples • N = total number of observations
• Sample Variance: • Population Variance:
N
n

(x i  x)
2
 i
( X  m )2

s =2 i 1 2  i 1

n 1 N
• Sample SD : • Population SD:
n N

(X i  X) 2
 i
( X  m )2

s s  2 i 1    2 i 1

n 1 N
87 Basic Statistics, June 2019
Summarizing Grouped Data
• When data are grouped or • mi = midpoint of i-th class
aggregated, we use these • i = frequency of the i-th class,
formulas:
n

 m
• n = Total frequency= f i
Mean: x  i i
i 1
n
   x i
2
m
Variance: s 2
 i

n 1
Standard Deviation: s  s 2

88 IBS 2017
Five Number
Summary and Box
Plot
Five Number Summary

90 IBS 2017
Box Plots
• A box plot allows you to:
• Graphically display the distribution of a data set.
• Compare two or more distributions.
• Identify outliers in a data set.

Outliers Whiskers

Q1 Q2 Q3
**

91 IBS 2017
Outliers

• Draw a straight line that extends from Q1 to the smallest


value that is not farther than 1.5 X IQR from Q1.

• Draw a similar straight line that extends from Q3 to the


largest data value that is not farther than 1.5 X IQR from
Q3.

• Use asterisk(*) to indicate points that are farther than 1.5 X


IQR from box. These points are considered outliers.

92 IBS 2017
5 Number Summary for Payment Time Data

Min 1st Median 3rd Max


Quartile Quartile

10.0 15.0 17.0 21.0 29.0

93 Basic Statistics, June 2019


5 Number Summary for Exe. 3.35 (pg. 160)

Min 1st Median 3rd Max


Quartile Quartile

15 4.5 4.75 4.875 5.125 5.25


Year
30 5.0 5.25 5.375 5.5 5.75
Year

94 Basic Statistics, June 2019


Summarizing Quantitative Data
• Frequency Distribution
• Relative Frequency, Percent Frequency, Cumulative Frequency
Distribution
• Histogram
• Mean
• Median, IQR, Five Number Summary, Box Plot
• Mode
• Variance, Standard Deviation

95 Basic Statistics, June 2019

S-ar putea să vă placă și