Intro To Statistics Part II

SSDC
Discussed the significance of Statistics for Physicians
IDM EDA
Suggested the study strategies for learning Statistics Presented the role of Statistics in the Scientific Process Reviewed basic concepts of Statistics including:
o Sample Selection & Data Collection o Initial Data Manipulation
DDA
PoC
o Tabular & Numerical Methods of Exploratory Data Analysis
Discuss Graphical Methods of Exploratory Data Analysis Present Basic Methods of Statistical Inference Introduce most commonly used statistical tests
Provide directions for further study of Statistics
Graphical methods allow to discover trends & patterns
COMMON GRAPHICAL METHODS

PIE chart LINE chart BAR chart HISTOGRAMS DOT plots STEM-&-LEAF plots BOX-WHISKERS plots SCATTER plots
FREQUENCY POLYGONS
Q-Q plots
A circular chart divided into sectors It illustrates numerical proportion Simple but flawed method Unsuitable for large data sets
Inconvenient for data comparisons

Commonly used in business Avoided in research
A series of data points (markers)
connected by straight line (segments)

It shows how data changes in time & in
response to interventions
It reveals trends very clearly
Perfusion of:
Error bars added to markers provide
additional info about data

Commonly used in research
HORIZONTAL
Useful for presenting DISCRETE data Uses horizontal or vertical bars to show
comparisons among categories

Axes represent categories vs discrete data
Radioactivity (dpm)
Stacked bar graphs: bars divided into subparts

Grouped bar graphs: bars clustered in groups Commonly used in research & business
Frequency
Useful for presenting CONTINUOUS data
Used to show the data distribution

Uses RECTANGLES whose:
BINS
o widths represent class intervals (BINS): i.e. equal width groups into which data are put
Frequency
o heights are proportional to the frequencies of the corresponding bins

BINS
Histogram a.k.a: Frequency Plot (FP) TRUE HISTOGRAM (TH) differs from FP: TH uses RECTANGLES whose:
o widths represent class intervals (BINS): like F.P., but o areas (not heights) are proportional to the corresponding frequencies o heights equal to the Frequency Density of the interval, i.e., the frequency divided by the width of the interval o FP is identical to TH only when relative frequencies and bins of equal length are used for the FP
BINS
Frequency Density
BINS = class intervals:

equal width groups into which data are put series of ranges of numerical value into which
data are sorted
Frequency
There is no "best" number of bins

Different bin sizes can reveal different features
BINS
FREQUENCY:
the number of observations in
FREQUENCY
each class interval (BIN)
Bins
Term histogram sounds awkward for physicians It evokes the word (hystera=uterus), but this is just a coincidence
The origin of Histogram is unknown.

It was likely coined by statistician Karl Pearson as: (histos: mast) + (gramma: writing) or historical + diagram
HISTOGRAM: representation of a frequency distribution by means of rectangles, whose widths represent class intervals (BINS) and whose areas are proportional to frequencies."
Data Set: {1,2,2,3,3,3,3,4,4,5,6} Divide it into BINS, note frequency: {[1] [2,2] [3,3,3,3] [4,4] [5] [6]}
The assignment pattern here is: Bin 1 contains 1: its frequency is 1 Bin 2 contains 2,2: its frequency is 2 Bin 3 contains 3,3,3,3: its frequency is 4, etc.
FREQUENCY BINS (class intervals)
In real data sets most numbers will be unique

Data Set: {3, 11, 12, 19, 22, 23, 24, 25, 27, 29, 35, 36, 37, 38, 45, 49}
Lets use Ranges as BINS
By using Ranges with width of 10,
Data 3 11,12,19 22,23,24,25,27,29 35,36,37,38 45,49
Data Range Frequency
0-10 10-20 20-30 30-40 40-50
1 3 6 4 2
Frequency
data can be organized as follows:
BINS: Data Ranges
HISTOGRAM allows to analyze dataset by reducing it to a single graph showing: significance of primary & secondary peaks
THIS HISTOGRAM tells us that in the dataset:

The peak is:
o well-defined o close to Median & Mean

Outliers are not frequent
Thus: the deviations from the mean are not frequent

The peak is:
o Not well-defined
o Fairly close to Median & Mean

Outliers are frequent
Thus: the deviations from the mean are frequent

There are two peaks:
o a taller primary peak & a shorter secondary o Median & Mean are hard to localize
Outliers are hard to define as well
Thus: there is a very poor definition of one signal or two signals in data
is drawn by joining the midpoints of the top of bars of a histogram

Unlike histograms, frequency polygons can be superimposed It is used to compare frequency distributions of multiple datasets on one diagram
FREQUENCY
BINS
FREQUENCY
BINS
Favorite Movies
Data plotted on a simple scale, using circles

SciFi
4 1
Comedy
4
Action
5
Horror
6
Drama
WILKINSON DP: Used for univariate data Simple representation of a distribution Useful for highlighting clusters, gaps & outliers in small datasets
CLEVELAND DP: Used for multivariate data Plots of points belonging to several categories Alternative to bar charts Bars are replaced by dots at the values associated with each category
Two columns separated by a line: o Right: Leaves contain last digit of the number o Left: Stem contains all of the other digits
Data set: 44,46,47,49, 63,64,66,68,68, 72,72,75,76, 81,84,88, 106
Visualization of a distribution in small data set

4 5 6 7 8 9 10 | | | | | | | 4 6 7 9 3 4 6 8 8 2 2 5 6 1 4 8 6
Useful for outliers and finding the mode Used in 80s (easy to make with typewriters) Became less common after computer
Displayed as:
Key: 6|3=63 Leaf unit: 1.0 Stem unit: 10.0
graphics became available
Upper Whisker
Q3 Q2
Mean
MEDIAN
I Q R
Q1
Lower Whisker OUTLIERS
Conventions used for the Box are uniform: Bottom: first quartile Q1 Line inside: Median Q2 Symbol inside (e.g.+): Mean Top: third quartile Q3 Height: Interquartile Range Width of the box is arbitrary, as there is no x-axis Spacings between parts of the box help indicate the degree of dispersion (spread) and skewness (asymmetry)
Upper OUTLIERS Whisker
Conventions used for Whiskers & Outlier vary:

Q3 Q2 Mean
Whiskers: their ends can represent:
o Minimum & Maximum of the data
MEDIAN
o one Standard Deviation above & below the Mean

o 9th percentile and the 91st percentile
Q1
o the 2nd percentile and the 98th percentile

Outliers: data not included between the whiskers are plotted
Lower Whisker OUTLIERS
with dots, circles, stars, or lines
All cases (but exceptions) fit between the upper & lower marks
Distribution is not symmetric: the median line is NOT in the middle of box
EXAMPLE OF MULTIVARIATE BOX-PLOTS
Boxplot for the height of 240 students by gender: MvF shows:

A difference in distributions of
height in MvF
Spread is similar across in M&F:
similar heights of the boxes on either side of the medians

There is more outliers among M
Uses Cartesian coordinates to display
values for two numerical variables

Useful when comparing discrete
variables vs numeric variables

Can suggest various kinds of
correlations between variables
y
THEORETICAL
Used for comparing two probability distributions
(DD) by plotting their quantiles vs each other

x-coordinate: quantile of 1st distribution
x
OBSERVED
y
THEORETICAL
y-coordinate: quantile of 2nd distribution QQ plot on line y=x -> DD are similar QQ plot near line y=x -> DD are linearly related Commonly used to compare an observed data
set to a theoretical model

x
OBSERVED
Drawing conclusions from data subjected to random variation such
as: observational errors, random sampling/experiments

It makes conclusions about populations by analyzing its sample
Includes inter-related areas: ESTIMATION CONFIDENCE INTERVALS HYPOTHESIS TESTS
Statistics reflect acts of interpretation not absolute facts Conclusions of a statistical inference are statistical PROPOSITIONS Final Inference is obtained by using following interrelated propositions: ESTIMATION: calculation of the attribute of sample (statistic) representing the Best Estimate" of the attribute of population (parameter) CONFIDENCE INTERVAL: calculated region likely to contain true values of the attributes of interest, it indicates the reliability of an Estimate HYPOTESIS TESTING: consideration if chance is a plausible explanation of findings, it uses the data to decide if hypothesis - that there is no relationship between measured phenomena (Null Hypothesis) - can be rejected
Statistical ASSUMPTIONS: suppositions about mechanisms & features of Population & Sample Statistical Inference relays on them Statistical MODEL: set of statistical assumptions Assumptions can be divided into: Non-Modeling (re: Population, Sample) Modeling (re: Distribution, Structure, Cross-Variation)
Based upon Modeling Assumptions INFERENCE can be divided into:

PARAMETRIC: assumes that the population fits
certain ideal distribution described by parameters

NON-PARAMETRIC: does not depend on the
population fitting any parameterized distributions

SEMI-PARAMETRIC: has both parametric and
nonparametric components
Assumes existence of idealized distribution for population from which the sample is drawn
Uses the known shape & parameters of that ideal distribution for the Inference
PI is less robust but simpler than NPI

Makes more assumptions than NPI If those assumptions are: o Correct -> better estimates than NPI o Wrong -> it is misleading Thus it is less robust than NPI

Its models are simpler than NPI Thus it is more convenient than NPI PI has been most commonly used Became subject of recent criticism*
(*) Nassim Taleb. The Black Swan: The Impact of the Highly Improbable. 2nd Ed, 2010
Relies on no or few assumptions about the shape or parameters of the
population distribution from which the sample is drawn

NPI is useful for study of populations that take on a ranked order
Compared to PI:
o NPI is frequently less convenient, but always more robust o NPI has less power (larger sample size is required to draw conclusions with same degree of confidence) o NPI can be occasionally simpler than PI
NPI is seen by some as leaving less room for misuse & misunderstanding
CAVEAT: Term non-parametric has additional meanings in Statistics
e.g. denotes techniques that do not assume that the structure of a model is fixed.
Using the data to provide a suitable guess at the population attributes

STATISTIC: any mathematical function of the data in a sample. E.g. ESTIMATOR: statistic for calculating an estimate based on data ESTIMATE: result of an estimator. E.g.: Mean, Variance POINT estimator: yields a single-valued result
INTERVAL estimator: yields a range of plausible values

ERROR of the estimator: reflects the degree of its precision and
reliability. It dependents on sample size.
a probability distribution that describes: the probabilities of the possible values - for a specific statistic
The form of SD will depend on the
population distribution
SD is necessary for constructing confidence
intervals & hypothesis testing
2 pool balls out of 3

Out come 1 2 3 4 5 6 7 8 9 Ball 1 Ball 2 1 1 1 2 2 2 3 3 3 1 2 3 1 2 3 1 2 3 Mean 1.0 1.5 2.0 1.5 2.0 2.5 2.0 2.5 3.0 3.0 1 0.111 2.0 3 0.333 1.0 1 0.111 Mean FREQ RE0L FREQ
a probability distribution that describes the probabilities of the possible values for the Mean
Consider a Normal Population
1.5
0.222
2.5
0.222
Take repeatedly samples of a given size from it.
Rel. Frequency: PROBABILITY
Sampling Distribution Of The Mean
Calculate Mean for each sample (Sample Mean)

Each sample has its own Mean value
The distribution of these Means is called the
sampling distribution of the mean

MEAN
STANDART ERROR of a statistic: Standard Deviation () of the Sampling Distribution of that statistic
STANDARD ERROR OF THE MEAN: SD () of the Sampling Distribution of the Mean
SEM vs : SEM: estimate of how far the sample mean is likely to be to the population mean. It is an Inferential Statistic. SD: degree to which individuals within the sample differ from the sample mean. It is a descriptive statistic.
the difference between the expected value of an Estimator & the corresponding population parameter - it is designed to estimate
Estimator is unbiased for a parameter A if
its expected value is precisely A

Estimator is biased for a parameter A if its
expected value differs from A

Unbiasedness is usually a desirable property
CLT states that the sampling distribution of any statistic
will be NORMAL or nearly Normal, if the sample size is large enough

Normal distribution is useful for modeling
CLT allows to make assumptions about the population
Standardized normal distribution, with an idealized Mean of 0 & Standard Deviation of 1. It allows to create a compact table for all normal distributions (widely used before computers became available) Score (raw score, datum): original datum (observation) that has not been transformed
Z- Score (Standard Score): number of standard deviations a score is from the mean of population
Positive Z-Score: a datum above the mean
Negative Z-Score: a datum below the mean
probability distribution that is used to estimate normal population parameters when the sample size is small &/or when the population Standard Deviation is unknown Per CLT sampling distribution of a statistic will follow a normal distribution (ND), as long as the sample size is sufficiently large Thus, when we know the standard deviation (SD) of the population, we can compute a z-Score, and use ND to evaluate probabilities with the sample mean In reality, sample sizes are sometimes small, and SDs of the population are unknown When either of these occurs, we have to rely on the distribution of the t statistic (also known as the t score)
Indicates precision & accuracy of the Estimator Margin of Error: reflects Observational Error
(Measurement Error): the difference between a measured value of quantity & its true value
Gives an estimated range of values which is likely

CI Visualized
300 experiments Sample size of 10 Mean: 50 CI shown in: Yellow: 95% CI containing the mean Red: those that do not Blue: 99% CI containing the mean (50) White: those that do not
to include an unknown population parameter

It is calculated from a sample dataset If independent samples are taken repeatedly
from the same population, & CI calculated for each sample, then a certain percentage (Confidence Level) of the intervals will include the unknown population parameter
CI are usually calculated so that this percentage
(CL) is 95% for the unknown parameter
n
Small n
Width of CI indicates how uncertain we are about the unknown parameter

Wide CI: less confident
Narrow CI: more confident

Wide CI -> more data should be collected before anything definite can be said about the parameter
Large n
CIs are more informative than hypothesis tests: they provide a range of plausible values for the unknown parameter
CI are underrated and underused in research
Most Medical Journals prefer to relay on p-value instead
Consideration if chance is a plausible
explanation of findings
It uses the data to decide if Null Hypothesis
can be rejected
Null Hypothesis H0: there is no relationship
between measured phenomena
Alternative Hypothesis H1: there is
relationship between measured phenomena. This is typically the hypothesis of interest to the researcher
The decision in HT - may be correct or may be in error. There are two types of errors, depending on which of the hypotheses is actually true:
TYPE I error is rejecting the Null
Hypothesis H0 when H0 is true

TYPE II error is failing to reject the
Null Hypothesis H0 when the Alternative Hypothesis H1 is true
rejecting the Null Hypothesis H0 when it is true

It asserts something that is absent, a false positive
The rate of TIE is called the Size of a test ()

It usually equals Significance Level of a test ()
If H0 is simple, is the probability of TIE

If H0 is composite, is the maximum of the possible
probabilities of TIE
The rate of TIIE is called false negative rate ()
Significance testing involves calculating the probability - that a statistic would differ as much or more from the parameter specified in the H0 - as does the statistics obtained in the experiment
One-tailed probability: probability computed considering
differences in only one direction, such as the statistic is larger than the parameter
Two-tailed probability: probability computed
considering differences in both directions (statistic either larger or smaller than the parameter)
1. Specify H0 Hypothesis 2. Select Significance Level () 3. Compute p-value
4. Compare p-value with
In significance testing H0 - is typically the hypothesis
that a parameter is zero or that a difference between parameters is zero

E.g.: the null hypothesis might be that the difference
between population means is 0
The Significance Level () is the highest value of a probability
value for which H0 is rejected
Common Significance Levels are: 0.05 & 0.01
For =0.05: H0 is rejected if the probability value p 0.05
Probability value (p value) is the probability of obtaining a
statistic as different or more different from the parameter specified in the H0 as the statistic obtained in experiment
The p value is computed assuming H0 is true
The lower p value, the stronger the evidence that H0 is false

Traditionally, H0 is rejected if p 0.05
Final step: comparison of p value with Significance Level If the p < then H0 is rejected Rejecting H0 is not an all-or-none decision The lower the p value the more confidence that H0 is false
If p > findings are inconclusive

Failure to reject the H0 does not constitute support for H0
It just means: there not sufficiently strong data to reject it
the probability that the test will reject H0 when H1 is true. the probability of not committing Type II Error Power of a Test = 1; ( is rate of TIIE)

SP (aka Sensitivity) is a function of the possible distributions, determined by a parameter, under the H1 As SP increases, the chances of TIIE decrease Power analysis can be used to calculate the minimum:
o sample size required to detect an effect of a given size
o effect size to be detected using a given sample size
SP is used to make comparisons between different statistical tests (e.g.: between a parametric and a nonparametric test of the same hypothesis)
In a research paper the focus is on the Results Statistical Tests are mentioned briefly in Materials & Methods This brief mention may indicate if the paper contains valid research or not.
Have the tests been properly chosen? Have the results been interpreted in
the context of tests capabilities?

When nonstandard, complex tests are
used: Is their use justified?
ANALYSIS
Compare means between 2 groups
PARAMETRIC Test
Two-sample
NONPARAMETRIC Test
Wilcoxon rank-sum test
Example
Is the mean systolic blood pressure (at baseline) for patients assigned to placebo different from the mean for patients assigned to the Tx group? Was there a significant change in systolic blood pressure between baseline and 6-month follow-up in the Tx group?
t-test
Paired
Compare 2 quantitative data from one subject
t-test
Wilcoxon signed-rank test
Compare means between 3 or more groups
Analysis of variance
Kruskal-Wallis
ANOVA
Pearson coefficient of correlation
ANOVA
Spearmans rank correlation
If our experiment had 3 groups & we want to know whether the mean systolic blood pressure at baseline differed among them?
Estimate the degree of association between 2 quantitative variables
Is systolic blood pressure associated with the Pts age?
Source: Hoskin T. Parametric and Nonparametric: Demystifying the Terms. Mayo Clinics
TYPE
o o Parametric Evaluates if the means of two groups are statistically different from each other Especially appropriate for the posttest-only two-group randomized experimental design Population from which the sample is taken has normal distribution The variances of the populations to be compared are equal Examines the difference between means relative to variability of their scores
APPLICATION
CORRECT USE
o
ASSUMPTIONS
o o
PROCEDURE
o
FORMULA
o o o o t = signal to noise ratio Top part: difference between two means (signal) Bottom part is a measure of scores variability (noise) Calculated t is determined as significant or not by using tables
Trochim WV. The Research Methods Knowledge Base, 2nd Edition. 2006
TYPE
o Parametric & Non-parametric
APPLICATION
o Evaluates if the means of more than 2 groups are statistically different
CORRECT USE of ANOVA TYPES

o One-way: difference between two or more groups with one grouping method o One-way repeated measures: when repeated measures are done in one group o Two-way: difference between groups with complex grouping o Two- way repeated measures: for repeated measures structure with an interaction effect
ASSUMPTIONS:
o The expected values of the errors are zero o The variances of all errors are equal to each other o The errors are independent from one another & normally distributed
PROCEDURE
o The mean is calculated for each group o The overall mean is then calculated for all of the groups combined o Within each group, the total deviation of each individuals score from the group mean is calculated. This is called within group variation o Next, the deviation of each group mean from the overall mean is calculated. This is called between group variation o Finally: F statistic is calculated
FORMULA
o F statistic: the ratio of between group variation to within group variation o If the between group variation is significantly greater than the within group variation, it is likely that there is a statistically significant difference between the groups. The statistical software determines if the F statistic is significant or not .
In general, this step entails presenting the
graphical and numerical results & inferred conclusions from other steps of Data Analysis in an accurate and concise form
Specifically, it involves presentation of
Abstract (in the form of a Poster or Oral Presentation) followed by preparation & publication of the Manuscript
SSDC
IDM
EDA
DDA
PoC
As discussed previously:
There is no single best statistical manual Set of personalized references has to be assembled Electronic texts that contain hyperlinks or allow for
the instant Web Searches for terms are best suited for studying Statistics
Following suggestions are examples of texts taken from the large pool of study materials
Harvey Motulsky: Intuitive Biostatistics: A Nonmathematical Guide to Statistical Thinking, 3rd Ed, 2013
o Non-mathematical Approach to Statistics by a Physician-Statistician
Betty Kirkwood: Essentials of Medical Statistics 2nd Ed, 2003

o Classic Statistical Manual; New Edition is being prepared
Yosef Dlugacz: Measuring Health Care: Using Quality Data for Operational, Financial, and Clinical Improvement. 1st Ed, 2006
o Deals with application of Statistics in Medical Business
Yasar A. Ozcan: Quantitative Methods in Health Care Management: Techniques and Applications 1st Ed, 2009
o Deals with application of Statistics in Medical Business
Android
Statistics Quick Reference by Nuzzed 2013
Statistics Tutor by Statistics Research 2014
iPhone/iPad
Learn Statistics by Miaoshuang Dong 2013 Statistics Video Lectures by Khan Academy 2013
Windows 8
Statistics Formulas by Hexxa 2013
Statistics and Probability by SimpleNEasy 2013
Online Statistics Education: A Multimedia
Course of Study. Rice University

o http://www.onlinestatbook.com/2/index.html
University of Oxford. Introduction to
Statistics for Medical Students

o http://www.well.ox.ac.uk/~kanishka/Lectures/ MSTC_researchers/Notes/notes%20for%20med ical%20students.pdf
G Singh. Medical Science without Statistics. The Internet Journal of
Healthcare Administration. 2006; 4:2

Altman DG. The scandal of poor medical research. BMJ 1994; 308:283 Ghami N. Good Clinical Care Requires Understanding of Statistics.
Psychiatric Times. March 6, 2009

Bennette C, Vickers A. Against quantiles: categorization of continuous
variables in epidemiologic research, and its discontents. BMC Med Res Methodol. 2012;12:21
Taleb N. The Black Swan: The Impact of the Highly Improbable & On
Robustness and Fragility 2nd Ed, 2010
Statistics plays pivotal role in Science & Business of Medicine
Statistics can be abused & misused

Statistical illiteracy & innumeracy can be extremely
detrimental for physicians

Study of Statistics is challenging but unavoidable Statistics is based on several sometimes counterintuitive
principles & conventions

Those axioms have to be mastered first
Statistics reflects acts of interpretation, not absolute facts

Therefore, it is based on numerous assumptions Understanding of statistical assumptive models is critical for
appraisal of Statistical Analyses

Statistics continue to evolve as methodology Reliance on Parametric Statistics & validity of p-value based
Hypothesis Testing are being recently questioned
Author wishes to thank: Stephen DeCherney, MD, MPH for his valuable comments.
Nothing to disclose: there are no known conflicts of interest associated with this presentation. Specifically, neither the author nor his family have any potential conflicts of interest, financial or otherwise regarding any of the discussed here products and/or services.
A type of experimental design in which the
experimental and control groups are measured and compared after implementation of an intervention.
Comparisons are made only after the
intervention, since this design assumes that the two groups are equivalent other than the randomly assigned intervention.
Between-group differences are used to
determine treatment effects.

Intro To Statistics Part II

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Intro To Statistics Part II

Încărcat de

Drepturi de autor:

Formate disponibile

SSDC

Discussed the significance of Statistics for Physicians

o Tabular & Numerical Methods of Exploratory Data Analysis

Provide directions for further study of Statistics

Graphical methods allow to discover trends & patterns

COMMON GRAPHICAL METHODS

Inconvenient for data comparisons

A series of data points (markers)

connected by straight line (segments)

Error bars added to markers provide

additional info about data

comparisons among categories

Stacked bar graphs: bars divided into subparts

Useful for presenting CONTINUOUS data

Used to show the data distribution

o heights are proportional to the frequencies of the corresponding bins

BINS = class intervals:

data are sorted

There is no "best" number of bins

each class interval (BIN)

The origin of Histogram is unknown.

FREQUENCY BINS (class intervals)

In real data sets most numbers will be unique

Data 3 11,12,19 22,23,24,25,27,29 35,36,37,38 45,49

Data Range Frequency

0-10 10-20 20-30 30-40 40-50

data can be organized as follows:

BINS: Data Ranges

THIS HISTOGRAM tells us that in the dataset:

o well-defined o close to Median & Mean

Thus: the deviations from the mean are not frequent

THIS HISTOGRAM tells us that in the dataset:

o Fairly close to Median & Mean

Thus: the deviations from the mean are frequent

THIS HISTOGRAM tells us that in the dataset:

is drawn by joining the midpoints of the top of bars of a histogram

Data plotted on a simple scale, using circles

Visualization of a distribution in small data set

Key: 6|3=63 Leaf unit: 1.0 Stem unit: 10.0

graphics became available

Lower Whisker OUTLIERS

Upper OUTLIERS Whisker

Conventions used for Whiskers & Outlier vary:

Whiskers: their ends can represent:

o Minimum & Maximum of the data

o one Standard Deviation above & below the Mean

o the 2nd percentile and the 98th percentile

Lower Whisker OUTLIERS

with dots, circles, stars, or lines

EXAMPLE OF MULTIVARIATE BOX-PLOTS

Boxplot for the height of 240 students by gender: MvF shows:

similar heights of the boxes on either side of the medians

Uses Cartesian coordinates to display

values for two numerical variables

variables vs numeric variables

correlations between variables

Used for comparing two probability distributions

(DD) by plotting their quantiles vs each other

set to a theoretical model

Drawing conclusions from data subjected to random variation such

as: observational errors, random sampling/experiments

Includes inter-related areas: ESTIMATION CONFIDENCE INTERVALS HYPOTHESIS TESTS

Based upon Modeling Assumptions INFERENCE can be divided into: