Sunteți pe pagina 1din 9

Introduction to Biostatistics

Statistics:

Statistics is a science that deals with the methods of collection, organization, analysis, interpretation
and presentation of information that can be stated numerically.

Biostatistics:

Biostatistics is simply statistics applied to the biological science, health and medicine.

Purpose of Statistics:

Population Sample

Parameter Statistics

Types of Statistics:

There are two main types of statistics:

I) Descriptive Statistics: Summarizes or describes the important characteristics of a known


set of population data.
II) Inferential Statistics: Use sample data to make inferences (or generalization) about a
population.
Used to make an inference, on the basis of data, about the (non) existence of a
relationship between the independent and dependent variables.
This includes generalizing from samples to populations using probabilities, performing
hypothesis testing, determining relationships between variables and making predictions.

Sampling:

In statistics, quality assurance, and survey methodology, sampling is concerned with the selection of
a subset of individuals from within a statistical population to estimate characteristics of the whole
population.

Two advantages of sampling are that the cost is lower and data collection is faster than measuring
the entire population.
Successful statistical practice is based on focused problem definition. In sampling, this includes
defining the population from which our sample is drawn. A population can be defined as including all
people or items with the characteristic one wishes to understand. Because there is very rarely enough
time or money to gather information from everyone or everything in a population, the goal becomes
finding a representative sample (or subset) of that population.

Sampling may be random or systematic.

In a simple random sample (SRS) of a given size, all such subsets of the frame are given an equal
probability. Furthermore, any given pair of elements has the same chance of selection as any other
such pair (and similarly for triples, and so on). This minimizes bias and simplifies analysis of results.

SRS can be vulnerable to sampling error because the randomness of the selection may result in a
sample that doesn't reflect the makeup of the population. For instance, a simple random sample of
ten people from a given country will on average produce five men and five women, but any given
trial is likely to over represent one sex and underrepresent the other.

Systematic sampling (also known as interval sampling) relies on arranging the study population
according to some ordering scheme and then selecting elements at regular intervals through that
ordered list. Systematic sampling involves a random start and then proceeds with the selection of
every kth element from then onwards. In this case, k=(population size/sample size). It is important
that the starting point is not automatically the first in the list, but is instead randomly chosen from
within the first to the kth element in the list. A simple example would be to select every 10th name
from the telephone directory (an 'every 10th' sample, also referred to as 'sampling with a skip of 10').

Systematic Sampling

Random Sampling
Hypothesis:

A hypothesis is the statement or an assumption about relationships between variables. A hypothesis


is a tentative explanation for certain behavior, phenomenon or events that have occurred or will
occur.

If you are going to propose a hypothesis, it’s customary to write a statement. Your statement will
look like this:

“If I…(do this to an independent variable)….then (this will happen to the dependent variable).”

For example:

 If I (decrease the amount of water given to herbs) then (the herbs will increase in size).
 If I (give patients counseling in addition to medication) then (their overall depression scale
will decrease).
 If I (give exams at noon instead of 7) then (student test scores will improve).
 If I (look in this certain location) then (I am more likely to find new species).

A good hypothesis statement should:

 Include an “if” and “then” statement (according to the University of California).


 Include both the independent and dependent variables.
 Be testable by experiment, survey or other scientifically sound technique.
 Be based on information in prior research (either yours or someone else’s).
 Have design criteria (for engineering or programming projects).

Null Hypothesis (Ho): This is the current accepted value for a parameter, or, the current accepted
relation between two variables. We always set up our null hypothesis so that we can reject the null
hypothesis.

Alternative Hypothesis (HA): This is the claim to be tested. It is also called research hypothesis or
working hypothesis.

Example: It is believed that a candy machine makes chocolate bars that are on average 5g. A worker
claims that the machine after maintenance no longer makes 5g bars. Write Ho and HA.

Null hypothesis “The candy machine makes chocolate bars that are on average 5g”

Ho : μ = 5g

Alternative hypothesis “The candy machine makes chocolate bars that are on average not 5g”

HA : μ = 5g
Testing Hypothesis:

By hypothesis testing either rejects the null hypothesis or fails to reject the null hypothesis. Fail to
reject and accept are not the same.

Statistical Significance: If the probability of obtaining a result as extreme as the one obtained,
supposing that the null hypothesis were true, is lower than a pre-specified cut-off probability (for
example, 5%), then the result is said to be statistically significant and the null hypothesis is rejected.

Statistical Inference and Biological Inference: Using statistics we can conclude whether the variables
studied are correlated or not. This is statistical inference which is not necessarily the same as
biological inference.

Type I error (α): A type I error occurs when the null hypothesis (H0) is true, but is rejected. It is
asserting something that is absent, a false hit. A type I error may be likened to a so-called false
positive (a result that indicates that a given condition is present when it actually is not present).

The type I error rate or significance level is the probability of rejecting the null hypothesis given that
it is true. It is denoted by the Greek letter α (alpha) and is also called the alpha level. Often, the
significance level is set to 0.05 (5%), implying that it is acceptable to have a 5% probability of
incorrectly rejecting the null hypothesis.

Type II error (β): A type II error occurs when the null hypothesis is false, but erroneously fails to be
rejected. It is failing to assert what is present, a miss. A type II error may be compared with a so-
called false negative (where an actual 'hit' was disregarded by the test and seen as a 'miss') in a test
checking for a single condition with a definitive result of true or false. A Type II error is committed
when we fail to believe a true alternative hypothesis.

The rate of the type II error is denoted by the Greek letter β (beta) and related to the power of a test
(which equals 1−β)

Tabularized relations between truth/falseness of the null hypothesis and outcomes of the test.

Null hypothesis (H0) is


Table of error types
True False

Type I error Correct inference


Reject
(False Positive) (True Positive)
Decision About Null Hypothesis (H0)
Correct inference Type II error
Fail to reject
(True Negative) (False Negative)
Significance Level and p-value:

For example: An economist wants to determine whether the monthly energy cost for families has
changed from the previous year, when the mean cost per month was $260. The economist randomly
samples 25 families and records their energy costs for the current year.

Why do we even need hypothesis tests? After all, we took a random sample and our sample mean of
330.6 is different from 260. That is different, right? Unfortunately, the picture is muddied because
we’re looking at a sample rather than the entire population.

Sampling error is the difference between a sample and the entire population. Thanks to sampling
error, it’s entirely possible that while our sample mean is 330.6, the population mean could still be
260. Or, to put it another way, if we repeated the experiment, it’s possible that the second sample
mean could be close to 260. A hypothesis test helps assess the likelihood of this possibility!

The significance level, also denoted as alpha or α, is the probability of rejecting the null hypothesis
when it is true. For example, a significance level of 0.05 indicates a 5% risk of concluding that a
difference exists when there is no actual difference.

These types of definitions can be hard to understand because of their technical nature. A picture
makes the concepts much easier to comprehend!

The significance level determines how far out from the null hypothesis value we'll draw that line on
the graph. To graph a significance level of 0.05, we need to shade the 5% of the distribution that is
furthest away from the null hypothesis.
In the graph above, the two shaded areas are equidistant from the null hypothesis value and each
area has a probability of 0.025, for a total of 0.05. In statistics, we call these shaded areas the critical
region for a two-tailed test. If the population mean is 260, we’d expect to obtain a sample mean that
falls in the critical region 5% of the time. The critical region defines how far away our sample statistic
must be from the null hypothesis value before we can say it is unusual enough to reject the null
hypothesis.

Our sample mean (330.6) falls within the critical region, which indicates it is statistically significant at
the 0.05 level.

We can also see if it is statistically significant using the other common significance level of 0.01.

The two shaded areas each have a probability of 0.005, which adds up to a total probability of 0.01.
This time our sample mean does not fall within the critical region and we fail to reject the null
hypothesis. This comparison shows why you need to choose your significance level before you begin
your study. It protects you from choosing a significance level because it conveniently gives you
significant results!

Thanks to the graph, we were able to determine that our results are statistically significant at the
0.05 level without using a P value. However, when you use the numeric output produced by statistical
software, you’ll need to compare the P value to your significance level to make this determination.
P-values are the probability of obtaining an effect at least as extreme as the one in your sample data,
assuming the truth of the null hypothesis.

This definition of P values, while technically correct, is a bit convoluted. It’s easier to understand with
a graph!

To graph the P value for our example data set, we need to determine the distance between the
sample mean and the null hypothesis value (330.6 - 260 = 70.6). Next, we can graph the probability
of obtaining a sample mean that is at least as extreme in both tails of the distribution (260 +/- 70.6).

In the graph above, the two shaded areas each have a probability of 0.01556, for a total probability
0.03112. This probability represents the likelihood of obtaining a sample mean that is at least as
extreme as our sample mean in both tails of the distribution if the population mean is 260. That’s our
P value!

When a P value is less than or equal to the significance level, you reject the null hypothesis. If we take
the P value for our example and compare it to the common significance levels, it matches the
previous graphical results. The P value of 0.03112 is statistically significant at an alpha level of 0.05,
but not at the 0.01 level.

If we stick to a significance level of 0.05, we can conclude that the average energy cost for the
population is greater than 260.

A common mistake is to interpret the P-value as the probability that the null hypothesis is true.

You can use either P values or confidence intervals to determine whether your results are statistically
significant. If a hypothesis test produces both, these results will agree.

The confidence level is equivalent to 1 – the alpha level. So, if your significance level is 0.05, the
corresponding confidence level is 95%.
Summary for Statistical Tests

Test When used Null Hypothesis Reject null hypothesis if…

(Student’s) t- Comparing two There is no significant calculated t value is greater


test populations with difference between the than or equal to critical t value
normally- (means of the) two
distributed data populations

Mann-Whitney Comparing two There is no significant smallest U value is less than or


U test populations with difference between the equal to the critical U value
skewed (medians of the) two
distribution populations

Spearman Rank Looking for a There is no correlation rs is greater than or equal to


Correlation test correlation between the two the critical value
between two variables
variables (also  if rs is positive =
whether it is positive correlation
positive or  if rs is negative =
negative and how negative correlation
strong)

Chi-squared Does an observed There is no significant calculated 2 value is greater


(2) test set of data differ difference between the than the critical value
from what we observed and expected
might expect from frequencies
the null
hypothesis?

It always depends on the hypothesis which test to use. Some are trying to find whether the
dependent variable increases or decreases as the independent variable increase. Example: does
temperature affect development. So as temperature increases does the percentage hatch rate of
brine shrimp increases or decreases? That’s correlation so Spearman Rank Correlation test will be
used.

Difference is when the independent variable is either absent or present, it is not increasing. Example:
does listening to music causes students to have higher or lower results?

Listening to music: 10/20 Not listening to music: 16/20

Here we use the statistical test for difference because we are not looking for a correlation but just
for a difference between two groups

U/T: Difference Spearman Rank: Correlation 2: Comparing expected Vs results obtained
Graphs:
There are several different types of charts and graphs. The four most common are probably line
graphs, bar graphs and histograms, pie charts, and Cartesian graphs. They are generally used for,
and best for, quite different things.

You would use:

Bar graphs to show numbers that are independent of each other. Example data might include things
like the number of people who preferred each of Chinese takeaways, Indian takeaways and fish and
chips.

Histogram is a specific type of bar chart, where the categories are ranges of numbers. Histograms
therefore show combined continuous data.

Pie charts to show you how a whole is divided into different parts. You might, for example, want to
show how a budget had been spent on different items in a particular year.

Line graphs show you how numbers have changed over time. They are used when you have data that
are connected, and to show trends, for example, average night time temperature in each month of
the year.

Cartesian graphs have numbers on both axes, which therefore allow you to show how changes in
one thing affect another. These are widely used in mathematics, and particularly in Algebra.

Scattered graphs are drawn for correlation (when Spearman Rank Correlation Test is used.

S-ar putea să vă placă și