Sunteți pe pagina 1din 15

!

PRIVATE - Megha Majumder!

Chapter 1 - Controlled Experiments !


I. The Salk Vaccine Field Trial!
A. Experiment that tests effectiveness can be done via COMPARISON. Drug is given to subjects in a treatment group, but other subjects are
used as controls (not retaed). Then, the responses of the two groups are compared. Subjects should be assigned to treatment or control at
RANDOM, and the experiment should be run DOUBLE-BLIND: neither the subjects nor the doctors who measure the responses should
know who was in the treatment group and who was in the control group. !
B. Polio - one of the vaccines in the 1950s, developed by Jonas Salk, was tested. But, when the vaccine came out, it wasnt possible to
observationally prove its effectiveness because its an epidemic whose incidence varied from year to year. Low incidences couldve meant
that the vaccine was effective OR that that year wasnt an epidemic year. !
C. Only way to find out if vaccine worked was to deliberately leave some children unvaccinated and use them as controls. NFIP ran a
controlled experiment to show effectiveness of vaccine. !
D. Grades most vulnerable to polio: 1, 2, and 3. Field trial was carried out in selected school districts throughout the country where risk of polio
was high. Two million children were involved. 500,000 vaccinated, 1,000,000 left deliberately unvaccinated. 500,000 refused vaccination. !
E. ^^^Comparison! Only subjects in the treatment group were vaccinated: the controls did not get the vaccine. The responses of the two
groups could then be compared to see if the treatment made any difference. !
F. Although the size of the samples differed, the investigators compared the rates at which children got polio in the two groups - cases per
hundred thousand. Looking at rates instead of just numbers adjusts for the differences in the sizes. !
G. Children whose parents consented would go into the treatment group and get the vaccine; the others would go as controls. However - its
known that the higher-income parents would more likely consent to treatment than lower-income parents. This makes the study biased
AGAINST the vaccine, because children of higher-income parents are more vulnerable to polio. !
H. Polio is a disease of hygiene so rich kids are more likely to get it. Poor kids roll around in dirt, and after being infected with something, they
generate their own antibodies which protect them against more severe infection later. !
I. If the two groups differ with respect to some factor other than the treatment, the effect of this other factor might be CONFOUNDED (mixed
up) with the effect of the treatment. Thus, the confounding variable is a major source of bias. !
J. Study design - vaccinate grade 2 children whose parents would consent, leaving children in grades 1 and 3 as controls. Potential bias: polio
is contagious, spreading through contact, so the incidence could have been higher in grade 2 than in grades 1/3, biasing study against the
vaccine. Also, in the control group parental consent was not needed, so you had people in the treatment group being more vulnerable to
polio than the control group (another bias against the vaccine). !
K. The chance of assignment to the treatment group or the control group was 50-50 for the SALK vaccine field trial - it was RANDOMIZED
CONTROLLED. !
L. PLACEBO - another basic precaution used, in which the children in the control group were given an injection of salt dissolved in water.
During the experiment, the subjects didnt know whether they were in treatment or control, so their response was to vaccine, not to the idea
of a treatment. !
M. DOUBLE-BLINDING - Salk trial doctors could have been affected by knowing whether the child was vaccinated or not when diagnosing the
polio, so the experiment was double blinded. The subjects didnt know whether they got the treatment or the placebo, and neither did those

! N.
who evaluated the responses.!

NFIP was biased against the vaccine. The rate of the vaccines success is less than that of the Salk trials treatment rate. Main source of

bias = confounding. Control group couldnt be compared to treatment group because control group had kids whose parents wouldnt have
consented, and the treatment group only had kids whose parents consented. !
O. RCDB reduces bias to a minimumthe main reason to use it. The assignment is made to treatment or control. No confounding variables
can explain the results aside fromt he treatment. Also, each child has a 50-50 chance ot be in treatment or control. Each polio case has a
50-50 chance to turn up in the treatment or the control group, then. Thus, the number of polio cases in the two groups must be about the

!
II.
same. !

The Portacaval Shunt!


i. Portacaval shunt - a shunt that redirects blood flow during surgery when someone has cirrhosis and the patient is bleeding to death. Its
hard to make the shunt, so do the benefits outweigh the risks?!
a. 32 without controls!
b. 15 with non-randomized controls!
c. 4 with randomized controls!
ii. Of the 51 studies conducted to assess the effect of the surgery:!
a. 75% of studies without controls were markedly enthusiastic about shunt (yes benefits outweigh risks!) - aka the poorly designed
studies exaggerate the value of the surgery!
b. 67% of the non-randomized studies were markedly enthusiastic (assignment to treatment or control was not random) - aka the poorly
designed studies exaggerate the value of the surgery!
c. 0% of randomized studies were markedly enthusiastic - aka the well-designed studies show the surgery to have little or no value. !
iii. The people in the control group are like the people n the treatment group in RC. In non-randomized experiment, ineligible patients can be
used as controls (like the ones who are sicker), so the surgeon chooses to operate only on the healthier patients. !
iv. Three-year survival rates show the subjects selected for surgery in the non-randomized studies were healthier than the controls. Tehre is
bias in favor of the surgery with the non-randomized experiment because sicker people were used as controls and healthier as surgery-

! recipients, so a greater percentage survived obviously. !

III. HISTORICAL CONTROLS!


i. Historical Controls - patients that were treated the old way in the past. The treatment group and historical control group may differ in
important ways besides the treatment. !
ii. In a controlled experiment, theres a group of patients eligible for a treatment at the beginning of the study. Some of these are assigned to
treatment, others used as controls. Assignment to treatment or control is done CONTEMPORANEOUSLY, or in the same time period. !
iii. Portacaval shunt experiment - poorly controlled trials were done with historical controls, or non-randomized controls. !
iv. Coronary bypass surgery - common and expensive operation for coronary artery disease. The badly-designed studies were more
enthusiastic about the value of the surgery. !
v. Three-year survival rates for surgery patients and controls show that that the treatment and historical control groups differed: patients
selected for surgery were healthier.!

vi. The controls in historically-controlled group have a much smaller survival rate because the controls were more unhealthy. !
vii. Randmized trials avoid that kind of bias, which is why design of study matter. !
viii. Historical controls of DES, a drug that is used to prevent spontaneous abortion, also proved to not help in a randomized controlled

!
1.
experiment. In fact, it gave the kids a lethal form of cancer. !

In the Salk Vaccine Field Trial of 1954, by NFIP, two million children from grades 1 through three in schools across the USA were involved in the
experiment process. In total, 500,000 students were vaccinated, 500,000 refused, and 1,000,000 students were deliberately left unvaccinated as
part of the control. The researchers then determined the polio incidence rate post-vaccination and compared it to that of the previous year (1953).
The 500000 students who were vaccinated were done so with the consent of the parents, and it was primarily higher-income families who
consented to the vaccination of their children - also, the pool of students who were vaccinated were all second graders. Children from grades 1

! and 3 were used as the 1,000,000 controls. The results for the trial were as follows: !

NFIP study!! ! Size! ! Rate (# of cases per 100000 subjects)!


Grade 2 (vaccine) ! ! 225,000! 25!
Grades 1 and 3 (control) ! 725,000! 54!

!
Grade 2 (no consent)! ! 125,000! 44!

What are the potential ways in which the results of this study could have been biased?!
- Children could not be vaccinated without parental consent!
- Higher-income parents were more likely to give consent!
- Children of higher-income families were more vulnerable to polio!

!
- Infection rate can vary from grade to grade!

How could this experiment be made to be less biased? Give 2 examples.!


Treatment and control groups should be as similar as possible, except for the treatment. Use randomness rather than human judgment to assign

!!
subjects to groups and avoid bias.!

IV. SUMMARY!
i. Statisticians use the method of comparison. They want to know the effect of a treatment (like the Salk vaccine) on a response (like getting
polio). To find out, they compare the responses of a treatment group with a control group. Usually, it is hard to judge the effect of a
treatment without comparing it to something else. !
ii. If the control group is comparable to the treatment group, apart from the treatment, then a difference in the responses of the two groups is
likely to be due to the effect of the treatment. !
iii. However, if the treatment group is different from the control group with respect to other factors, the effects of these other factors are likely to
be CONFOUNDED with the effect of the treatment.!
iv. To make sure that the treatment group is like the control group, investigators put subjects into treatment or control at random. This is done
in randomized controlled experiments. !
v. Whenever possible, the control group is given a placebo, which is neutral but resembles the treatment. The response should be to the
treatment itself rather than to the idea of the treatment.!
vi. In a double0blind experiment, the subjects do not know whether they are in treatment or in control; neither do those who evaluate the

!! responses. This guards against bias, either in the responses or in the evaluations. !

8. Some studies find an association between liver cancer and smoking. However, alcohol consumption is a confounding variable. This means!
(ii) Drinking is associated with smoking, and alcohol causes liver cancer.!
A confounding variable is a source of bias due to the fact that it is a factor by which two groups (a treatment group and a control group) differ due to
some factor other than the treatment itself. It is a third variable that is associated with exposure and with disease. (Freeman, Statistics) In this
particular case, alcohol is a confounding variable, meaning it got mixed up with smokers, smokers are generally known to consume more alcohol as
opposed to non-smokers, and greater alcohol consumption can lead to liver cancer. Thus, drinking is associated with smoking because those who

!
smoke are likely to drink greater quantities of alcohol as opposed to non-smokers, and that heavier consumption of alcohol is what causes liver cancer. !

9a. Does screening save lives? Which numbers in the table prove your point? !
Yes, screening does save lives. Out of the 31,000 people in the total treatment group, 39 deaths occurred from only breast cancer, giving a rate of 1.3.
Out of the 31,000 people in the control group, 63 people died from breast cancer only, giving a much higher rate of 2.0 compared to the 1.3 rate in the

!!
treatment group (the women who had undergone the HIP screening trial). !

Chapter 2 - Observational Studies!


I. Introduction!
i. controlled experiment - a study where the investigators decide wholl be in the treatment group and who will not. Control = someone who
didnt get the treatment. !
ii. Observational study - the subjects assign themselves to different groups, and the investigators just watch what happens. Treatment-control
idea is still used - but subjects assign themselves to treatment or choose to be in control not by any influence of the investigators. !
iii. To determine if giving up smoking will make you live longer via observational study:You have to control for confounding variables like age
and sex. Older people have diff smoking habits and are more at risk for lung cancer, so you have to compare smokers and non-smokers by
age. Also, men are more likely to get heart disease than somen, so you have to compare smokers and non-smokers within sexes. EX:
compare male smokers age 55-59 and male non-smokers 55-59.!
iv. Association between treatment and outcome is circumstantial evidence for causation.!
v. Association does not prove causation. Confounding factors may exist.!
vi. Observational studies can be powerful tools but can also be misleading.!
a. Were the control and treatment groups similar?!
b. Did the two populations differ in ways other than the treatment?!

!
II.
vii. Technique: compare small, more homogeneous groups (e.g., age, sex)!

THE CLOFIBRATE TRIAL!


i. Coronary Drug Project: randomized, controlled double-blind experiment (placebo = lactose) to evaluate heart attack prevention drugs!
ii. 8,341 patients followed for five years (5,552 got treatment, 2,789 controls)!
iii. Clofibrate: a cholesterol reduction drug evaluated in the study!
iv. Comparing 20% to 21% shows that clofibrate did not save lives.!
v. Many subjects did not take their medicine (non-adherers).!
vi. Compare 15% to 15% (not to 21% or 25%) to control for adherence, which proved that:!
vii) Conclusions: !
a. Clofibrate does not have an effect. !
b. Adherers are different from non-adherers. The reason why adheres lived longer than non-adherers in both the treatment and control

!
III.
groups is that they likely were more concerned with their health and took better care of themselves in general, taking capsules. !

MORE EXAMPLES!
i. Example 1 - Pellagra: Among many associations between the 18th century disease and other factors, lack of niacin was found to be the
underlying cause. Pellaga=disease found in European villages that caused disability and death. The households where the disease struck
were usually unsanitary, and filled with blood-sucking flies. The fly had the same geographical range as pellagra cases, and the times of the
disease and when the fly was most active coincided. Epidemiologists concluded that the disease was infectious - and like malaria, it was
transmitted via insects. However, the flies were just a marker of poverty. Impoverished people ate corn which has little niacin. Niacin occurs
in P-P (pellagra-preventive) factors, which is in meat, milk, eggs, veggies. Not corn. People who got pellagra were too poor to eat those, so
they ate corn. Flies only indicated that they were in poverty. !
ii. Example 2 - Cervical Cancer and Circumcision: Human papilloma virus was found to be the underlying cause and explained differences in
cancer rates between particular populations in the 1950s. Cervical cancer used to be the most common cancers among women, but in
Jews and Muslims, it was rare. Investigators thought that because these populations were kinda unaffected, it was circumcision of males
that was the protective factor. HOWEVER, it turns out that STDs were the cause of the cervical cancer, and HPV is the causal agent. !
iii. Example 3 - Ultrasound and Low Birthweight: Association in observational study with babies exposed to ultrasound exams and babies who
werent exposed. Is this evidence that ultrasound causes lower birthw7eight? No - women get ultrasound exams if they are likely or expect
to have problem pregnancies. The confounding factor of problem pregnancy was found to explain an association between ultrasound and
low birthweight. Randomized controlled experiments showed that ultrasounds may be protective.!
iv. Example 4 - The Samaritans and Suicide: A confounding factor explained an association between the expansion of a volunteer welfare
organization and a decrease in the English suicide rate in 19641970. Not the fact that the Samaritans prevented suicides, as the
investigator thought. To control for confounding variables, the towns in a pair were matched on the variables regarded as important (one
town had a branch of Samaritans, the other did not). However, when another investigator used a bigger sample size, he found no effect.
Suicide rate was stable in the 70s, even though the Samaritans continued to expand. The decline in suicide rates in the 60s was explaind

!
IV.
by a shift from coal to natural gas which is less toxic. !

SEX BIAS IN GRADUATE ADMISSIONS!


i. Observational study on sex bias in admissions at UC, Berkeley in 1973!
i. 44% of 8,442 male applicants were admitted!
ii. 35% of 4,321 female applicants were admitted!
ii. Compare admissions to the six largest majors:!
iii. Major A: Less selective but few women and many men applied!
iv. Major E: Highly-selective but many women and few men applied!

!
V.
v. Simpsons paradox: Relationships between percentages in subgroups can be reversed when the subgroups are combined. !

CONFOUNDING!
i. Confounding: A difference between the treatment and control groups other than the treatment that affects the responses being
studied. (Fishers Constitutional Hypothesis)!
ii. Confounders must be associated with both:!
i. The disease/outcome and!
ii. The exposure/treatment.!
iii. ex: if theres a gene which increases the risk of lung cancer and it ALSO gets people to smoke, it meets both the tests for a
confounder. This gene would create an association between smoking and lung cancer.
(A gene that causes cancer but is unrelated to smoking is not a confounder and is sideways to the argument, because it does not
account for the facts the association between smoking and cancer.)!
iii. Hidden confounders are a major problem in observational studies.!
iv. Examples:!
i. NFIP polio vaccine study: family income.!
ii. Portacaval shunt studies: health of patients selected for surgery!
iii. Coronary bypass surgery studies: health of patients selected for surgery!

!
VI.
iv. Cervical cancer study: sexual activity!

SUMMARY AND OVERVIEW!


i. In an observational study, investigators do not assign the subjects to treatment or control. Some of the subjects have the condition whose
effects are being studied; this is the treatment group. The other subjects are the controls. For example, in a study on smoking, the smokers
form t he treatment group and the non-smokers are the controls. !
ii. Observational studies can establish association: one thing is linked to another. Association may point to causation: if exposure causes
disease, then people who are exposed should be sicker than similar people who are not exposed. But association does not prove
causation. !
iii. In an observational study, the effects of treatment may be confounded with the effects of factors that got the subjects into control or
treatment in the first place. Observational studies can be quite misleading about the cause-and-effect relationships, because of
confounding. A CONFOUNDER is a third variable, associated with exposure and disease. !
iv. When looking at a study, ask the following questions. Was there any control group at all? Were historical controls used, or
contemporaneous controls? How were subjects assuaged to treatmentthrough a process under the control of an investigator (controlled
experiment) or a process outside control of investigator (observational study)? If a controlled experiment, was the assignment made using a
chance mechanism (randomized controlled), or did assignment depend on the judgement of the investigator?!
v. With observational studies and with nonrandomized controlled experiments, try to find out how the subjects came to be in treatment or in
control. Are the groups comparable? Different? What factors are confounded with treatment? What adjustments were made to take care of
confounding? Were they sensible? !
vi. In an observational study, a confounding factor can sometimes be CONTROLLED FOR, by comparing smaller groups which are relatively
homogeneous with respect to the factor.!
vii. Study design is a central issue applied in stats. The great weakness of observational studies is confounding: randomized experiments

!! minimize this problem. !

Chapter 3 - The Histogram!


I. INTRODUCTION!
i. A histogram is a graph used to summarize data.!
ii. The total area under the curve is 1 (that is, 100%).!
iii. The horizontal axis is divided into class intervals.!
iv. The area of a rectangle is proportional to the percentage of data values in the class interval.!

! v. Areas of the blocks represent percentages !

#10a p23 (the refused group is your control here since they aren't examined)!
To show that screening reduces the risk from breast cancer, someone wants to compare 1.1 and 1.5. Is this a good comparison? Is it biased against

!
screening? For screening?!

No, this is not a good comparison because of the many difference between the groups that can exist, including all of the potentially confounding
variables. It is known to us that folks from a higher socioeconomic class are more likely to accept screening, and those who reject it are more likely
from a lower socioeconomic class. The two different classes also have separate incidences of the breast cancer. Socioeconomic class, then is a
potential confounding variable because it is associated with both the treatment in the sense that people who belong to different socioeconomic statuses
have different reactions when theyre offered screening, and its also associated with the outcome in the sense that it is known that people from
different socioeconomic classes are affected by breast cancer in separate, significantly different manners. This particular comparison, between 1.1 and
1.5, makes screening look worse because we are comparing the two subgroups within the treatment group, examined and refuseed. The examined are
likely to be from one class, and the refused are likely to be from the other. The rate for the examined is low because of the screening, likely, and the
rate for the refused is also low in comparison to the control group because it started off as low - the peoiple who refused were less likely to get breast
cancer in the first place, which is prior knowledge based on the information we were given with. The subgroup agreed to be screened has a higher
incidence of breast cancer, and the subgroup that refused had a lesser one. This means that, in total, both the subgroups together would have a lower

!
overall death rate due to the disease (lower than 1.5 that is), when it is compared to the control. !

#11 p23!
Cervical cancer is more common among women who have been exposed to the herpes virus, according to many observational studies. Is it fair to
conclude that the virus causes cervical cancer? !
It is not fair to conclude that the virus causes cervical cancer because herpes and cervical cancer are both sexually transmitted diseases, or attained
through them. Rather, it is more likely that the two are associated with one another (but not causation), and there is a confounding variable that is the
cause for the cervical cancer. For example, women who have many sexual partners are placed at a higher risk for both cervical cancer and herpes, so
the increased number and kind of partners means that theres an increased chance that those women would contract the STDs. The confounding
factor, then, would be the number of sexual partners the women have, which is the variable that is likely influencing the women to attract both

!
diseases.!

#11 p27!
California is evaluating a new program to rehabilitate prisoners before their release; the object is to reduce the recidivism rate - the percentage who will
be back in prison within two years of release. The program involves several months of boot camp - military-style basic training with very strict
discipline. Admission to the program is voluntary. According to a prison spokesman, Those who complete boot camp are less likely to return to prison

!
than other inmates.!

a. What is the treatment group in the prison spokesmans comparison? the control group?!
The treatment group in the prison spokesmans comparison is the prisoners who actually chose to complete the boot camp and finished it. The control
group is the group of prisoners who dont volunteer to complete the program, or those who do volunteer but cant complete it. !
b. Is the prison spokesmans comparison based on an observational study or a randomized!
controlled experiment?!
The prison spokesmans comparison is based on an observational study because its the prisoners who decided if they wanted to volunteer for the
camp, and also their decision to stay in it or to leave it. The investigators did not randomly choose who was to be in the boot camp and who wasnt.!
c. True or false: the data show that boot camp worked.!
This is false because the data did not explicitly show that boot camp worked. The boot camp people volunteered, so it is not possible to know whether
or not their decreased recidivism rate is because of the boot camp or another confounding factor. For example, its possible that the people who chose

!
to be in boot camp were the ones who were extremely committed and willing to work hard to live better lives and stay out of prison. !

#2 p33!
2. In figure 2, were there more families earning between 10,000 and $11,000 or between $15,000 and $16,000? Or were the numbers about the same?

!
Make your best guess.!

!
There were more families between $10,000 and $11,000.!

#3 p33 - Histogram!
(a) B represents the people who scored between 60 and 80.!
(b) 20% scored between 40 and 60.!

!
(c) 70% scored over 60. !

#4 p42!
In a public health service study, a histogram was plotted showing the number of cigarettes smoked by each subject (male current smokers), as shown
below. The density is marked in parentheses. The class intervals include the right endpoint, not the left. !
(a) 15%!
(b) 30%!
(c) 50%!
(d) 10%!

!!
(e) 3.5%!

II. DRAWING A HISTOGRAM!


i. A distribution table shows the percentage of data in each class interval.!
ii. Choose an endpoint convention - left or right (e.g. put left endpoints in class intervals, but exclude right, or vice versa)!
iii. Use class intervals to draw horizontal axis !
iv. To figuire out the height of a block over a class interval, divide the percentage by the length of the interval. (width x height = percentage)!
v. This way, the area of the block equals the percentage of families in the class interval. !
vi. Units on the vertical scale = percent per (horizontal units)!
vii. The height of the block over the interval $7,000 to $10,000 is 5% per thousand dollars: in other words, in each thousand-dollar interval between
$7,000 and $10,000, there are about 5% of the families.!

!
viii. The unit on the horizontal axis is $1000 of family income, and the vertical axis shows the percentage of families per $1000 of the income. !

III. GENERATING A HISTOGRAM FROM DATA!


i. Toss a fair coin n = 4 times and count the number of heads.!
ii. Repeat this experiment N = 10 times.!
iii. Example: 3, 1, 3, 2, 0, 2, 1, 4, 2, 0 heads in the 10 trials gives the histogram below left.!
iv. If we repeat the experiment N = 1000 times, we get a histogram such as the one shown below right.!
IV. THE DENSITY SCALE!
i. The histogram below shows years of school completed by persons age 25 and older in the U.S. in 1991.!

ii. Endpoint convention: years of school completed (e.g., people who dropped out part way through ninth grade are in the 89 block) (left
endpoint)!
iii. Units on the vertical axis are percent (of people) per year (of schooling).!
iv. Area represents percent. Total area = 100%.!
v. Box heights show crowding. Crowding - represented by the height of the block. !
i. The blocks over 8-9 and 9-12. The 8-9 is a little taller, so this interval is a little more crowded. The 9-12 is taller, so it has a larger area
with more people. Theres more room in the 9-12 interval cuz its longer. Its like Netherlands (small country) being more crowded,
even tho the US has more people.!
ii. In a historgram, the height of a block represents crowding percentage per horizontal unit. !
vi. Peaks: 89, 1213 and 1617 - peaks show how people tend to stop their schooling at one of the three possible graduations rather than
dropping out inbetween. !
vii. 12-13 = all the people with high school degrees. Some may have gone to college but didnt finish first year. Left endpoint convention means

! !
that the right endpoint values begin at the next block.!
viii. With the density scale on the vertical axis, the areas of the blocks come out in percent. The area under the histogram over an interval
equals the percentage of cases in that interval. The total area under the histogram is 100%. MAKE SURE IT EQUALS 100% (and not

! 200%).!

IV. VARIABLES!
i. A (random) variable is a measurement that depends on the outcome of a (random) event. A variable is a characteristic which changes from
person to person in a study.!
ii. Quantitative variables have numeric values. (age, family size, income)!
i. Continuous variables can assume a continuum of values: Examples include income, temperature, pressure, mass, and speed, age. !
ii. A discrete variable can assume only finitely (or countably) many values. Examples include: family size, and number of engine
cylinders. You can differ by 1 or 2 or 0, but nothing in between is possible.!
iii. Qualitative variables are non-numeric. Values are typically descriptive words or phrases. !

!! i. Examples include: marriage status, true or false, employment status, eye color, automobile transmission type.!

iv. With discrete variables, center the class intervals at the possible values (so family size can be 2, 3, or 4. Class Interval for 2 = 1.5 to 2.5;
class interval for 3 = 2.5 to 3.5!
v. In the March Current Population Survey, women are asked how many children they have. Results are shown below for the women age

25-39, by educational level. !


i. Is the number of children discrete or continuous?!
i. The number of children is discrete.!
ii. Draw histograms for these data.!
KEY:!
- Black columns = women who are high school graduates. !

!
- Outlined columns (no fill) = women with college degrees.!

50 50

25 25

0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6

Number of children Number of children

iii.What do you conclude?!


i. Women who have college degrees have fewer children, so women who are better educated (have more years of schooling)
have fewer children. The percent of women per number of children is generally more skewed towards the left end of the graph

! for the women with college degrees than those with high school degrees only, thus supporting my conclusion. !

V. CONTROLLING FOR A VARIABLE!


i. Experiment: 2 groups!
ii. users who took the birth oral contraceptive pill (Treatment group)!
iii. non-users who dont take the pill (the control group) !
iv. Problems: its observational. Women decided whether or not to take the pill. One problem is its effect on blood pressure. Blood pressure
tends to go up with age, and non-users were on the whole older than the users in treatment group. 70% of nonusers were over 30,
compared to 50% of the users. Effect of AGE is confounded with the effect of the pill. !
v. To make the full effect of the pill vilisble, its necessary to make a separate comparison for each age group, THUS CONTROLLING FOR
AGE. !
vi. Results suggest that if a woman goes on the pill, her blood pressure will go up around 5 mm, but proof is incomplete. Drug study is an
observational study, not controlled. There could be some factor other than pill or age, which affects the blood pressure (though its not true -

! it actually does affect it). !

VI. CROSS TABULATION!

! i. To make comparisons between age groups or subgroups, use a cross-tab!

VII. SELECTIVE BREEDING!


i. breeding rats to breed for intelligence e- maze-bright rats (rats making only a small number of errors in a maze) bred with each other, and
maze dulls were bred together. There was a clear separation in scores seven generations later. But, measuring for mental ability - evidence
that some mental abilities are at least in part genetically determined. BUT it was found that the maze-bright rats did no better than the
maze-dulls on other tests of animal intelligence - evidence against his theory. But brights seemed to be introverts (couldnt have good

!
p51 #4 !
relationships with other rats but well adjusted to life in the maze). the dulls were opposite.!

The figure below is a histogram showing the distribution of blood pressure for all 14,148 women in the drug study (sec 5). Use the histogram to answer
the following questions:!
a) Is the percentage of women with blood pressures above 130 mm around 25%, 50%, or 75%?!
a) 25%!
b) Is the percentage of women with blood pressures between 90 mm and 160 mm around 1%, 50%, or 99%!
a) 99%!
c) In which interval are there more women: 135-140 mm or 140-150 mm?!
a) 140-150 mm!
d) Which interval is more crowded: 135-140 mm or 150-150mm?!
a) 135-140mm is more crowded!
e) On the interval 125-130 mm, the height of the histogram is about 2.1% per mm. What percentage of women had blood pressures in this class
interval?!
a) 10.5% which is equal to the width times height (otherwise known as the area) of that section/bar.!
f) Which interval has more women: 97-98 mm or 102-103 mm?!
a) 102-103 mm interval has more women!
g) Which is the most crowded millimeter of all?!

! a)

SUMMARY !
115-120 interval!

1. a histogram represents percents by area. it consists of a set of blocks. The area of each block represents the percentage of cases in the
corresponding class interval.!
2. With the density scale, the height of each block equals the % of cases in the corresponding class interval, divided by the length of that interval!
3. with the density scale, area comes out in percent, and total area is 100%. The area under the histogram between two values gives the
percentage of cases falling int hat interval.!
4. a variable is characteristic of the subjects in a study. It can be either qualitative or quantitative. A quant is either discrete or continuous.!

!!
5. a confounding factor is sometimes controlled for by cross-tabulation!

!
CHAPTER 4 - THE AVERAGE AND STANDARD DEVIATION!
I. INTRODUCTION!
A. Average - used to find center, as is median. !
B. Standard deviation - measures spread around average!
C. Interquartile range - another measure of spread.!
II. THE AVERAGE!
D. HANES - the Health and Nutrition Examination Survey - gets baseline data about demographic variables (age education income),
physiological variables (height weight BP, cholesterol levels), dietary habits, prevalence of diseases.!
E. Average of a list of numbers equals their sum, divided by how many numbers there are.!
F. HANES is cross-sectional, where different subjects are compared to each other at one point in time. Longitudinal would be when subjects
are followed over time, and compared with themselves at diff points in time. Cross sectional - means that everyones different; doesnt
mean that the average height of men decreases after an age. !

! G. If a study draws conclusions about the effects of age, find out whether data are cross-sectional or longitudinal. !

6. Twenty-one people in a room have an average height of 5 ft 6 inches. A 22nd person enters the room. How tall would he have to be to raise the avg

!
height by 1 inch? !

(21) 66 inches + (1)x!


= 67!

!
! 22!

!!
88 inches, or 7 ft 4 inch!

#4 P 70!
Each of the following lists has an average of 50. For each one, guess whether the SD is around 1, 2, or 10. !
a) 1 (all numbers deviate from average by +1 or -1)!
b) 2!
c) 2!
d) 2!

!
e) 10!

#13 p 24!
13. A hypothetical university has two departments, A and B. There are 2000 male applicants, of whom half apply to each department. There are 1100
female applicatns: 100 apply to dept A and 1000 to dept. B. Dept A admits 60% of men who apply and 60% of women. Dept B admits 30% of men and
30% of women. For each dept, the % of men admitted equals the % of women admitted: this must be so for both departments together. T or F, explain. !
This is false because Dept A accepts 60% out of 1000 men, which is just 600 men!
Department B accepts 30% out of 1000 men which is 300 men!
Dept A accepts 60% out of 100 women which is just 60 women!
Dept B accepts 30% out of 1000 women which is 300 women!
That totals to 360/1100 women = 32.7% which is a smaller total percentage of women being accepted into the two departments together, compared to

!
men.!

#7 p 26!
According to a study done at Kaiser Permanente in Walnut Creek, users of oral contraceptives have a higher rate of cervical cancer, even after
adjusting for age, education, and marital status. Investigators concluded that the pill causes cervical cancer. !
a) This is an observational study.!
b) The investigators adjusted for age, education, and marital status because they were potentially confounding variables, and the way to adjust is to
observe smaller, more homogenous groups of people. Also, as you age, there is an increased possibility of getting cervical cancer; women who
are married versus women who are single have different sex lives and thus differing sexual activity, so theyre affected by different risks of getting
cervical cancer, which also applies to women who have higher education. !
c) Women using the pill were likely to differ from non-users on another factor which affects the risk of cervical cancer, which is sexual activity - they
may have increased amounts of sexual activity and sexual partners, which increases their likelihood of contracting STDs which could result in
cervical cancer. !
d) The conclusions of the study were not justified by the data because the data showed association, not causation. Also, the confounding variable of
sexual activity was not accounted for. The oral contraceptive has not causal role in cervical cancer - theres only a relationship that exists, not one

! that is causal.!

III. THE AVERAGE AND THE HISTOGRAM!


i. The histogram balances when supported at the mean.!
ii. The first histogram below is symmetric about its mean. Half the data is to the left of the mean and half is to the right.!
iii. As the red box moves to the right, it pulls the average along with it. !
iv. A histogram balances when supported at the average. !
v. Median of a histogram is the value with half the area to the left and half to the right. !

vi. LONG RIGHT HAND TAIL - AVERAGE IS BIGGER THAN MEDIAN!

! vii. LONG LEFT HAND TAIL - AVERAGE IS SMALLER THAN MEDIAN!

IV. THE ROOT MEAN SQUARE!


i. The root-mean-square (or RMS) of a list of numbers measures the average!
ii. magnitude (ignoring signs) of the numbers in the list.!

iii. The calculation steps are:!


i. (1) square the entries of the list, SQUARE!
ii. (2) take the mean of this new list, and MEAN!
iii. (3) take the square root of this mean. ROOT!

! iv. RMS is used to compute the SD (or spread) of a list of numbers.!

V. THE STANDARD DEVIATION!


i. Standard deviation (SD) measures the spread of the data.!
i. Roughly 68% of the data falls within one SD of the average.!
ii. Roughly 95% of the data falls within two SDs of the average.!
ii. Average (mean) and median were measures of the center of the data.!
iii. Units of SD and average are the same as those of the data.!
iv. Variance is SD^2!
v. The SD says how far away numbers on a list are from their average. Most entries on the list will be somewhere around one SD away from

! the average. Very few will be more than two or three SDs away.!

VI. COMPUTING THE STANDARD DEVIATION!


i. Deviation from average = entry - average!
ii. SD = rms deviation from average!
SUMMARY!

1. A typical list of numbers can be summarized by its average and standard deviation!
2. Average of a list = SUM OF ENTRIES / NUMBER OF ENTRIES!
3. The average locates the center of a histogram, in the sense that the histogram blanches when supported at the average.!
4. Half the area under a histogram lies to the left of the median, and half to the right of the median. The median is another way to locate the center
of a histogram. !
5. The RMS size of a list measures how big the entires are, neglecting signs. !
6. RMS = SQUARE ROOT of (average of (entries^2))!
7. SD measures distance from average. Each number on a list is off the average by some amount. SD is a sort of average size for these amounts
off. SD is the rms size of the deviations from the average. !
8. Roughly 68% of the entries on a list of numbers are within one SD of the average, and about 95% are within 2 SDs of the average. not all lists.!

!
9. If a study draws conclusions about the effects of age, find out whether the data are cross-sectional or longitudinal.!

12 p 27!
A study is carried out to determine the effect of party affiliation on voting behavior in a certain city. The city is divided up into wards. In each ward, the
percentage of registered Democrats who vote is higher than the percentage of registered Republicans who vote. True or false: for the city as a whole,
the percentage of registered Democrats who vote must be higher than the percentage of registered Republicans who vote. If true, why? If false, give

!
and example. !

This is false because of Simpsons Paradox, which states that the relationships between percents in subgroups can be reversed when the subgroups
are combined. For example, if we had one ward where there were 10 Democrats in total, of which 10 voted, then wed have 100% of the Democrats
who voted. In that same ward, we had 1000 republicans, in which 990 voted, wed have a 99% vote. In total, the percentage of Democrats who voted
are higher in Ward 1. !
If we had a ward 2, in which there were 1000 Democrats, of whom 600 voted (giving a 60% turnout rate for Republicans); but then there were 10
Republicans, of whom 5 voted (giving a 50% turnout rate), then the Democrat percentage would again be higher in ward 2. However, if we totaled the
numbers of democrats in ward 1 and ward 2, we would have 1010 Democrats in total, of whom 610 voted, giving us about 60% of Democrats in total
who voted. When we total the numbers of republicans in wards 1 and 2, we have 1010 Republicans, of whom 995 voted, giving us a percentage of
about 98%. Thus, the Republican vote is higher in total due to the fact that the majority of the Republicans lay within the ward with the greatest number

!
of votes, even though the Democrats were in wards in which percentages were higher, but the voting wards were lower in number of people.!

2. p 48!
Draw the histograms for the blood pressures of the users and non-users age 17-24. What do you conclude?!
!

Non-users of Pill, Age 17-24


5
4
% Per mm

3
2
1
0 85 90 95 100 105 110 115 120 125 130 135 140 145 150 155 160 165

Blood Pressure (mm)


I conclude that using the pill is associated with increased blood pressure of measurable quantities of millimeters, as shown by the % of women who

!
were users having relatively higher blood pressures in comparison to the group of non-users of pills. !

!
#7 page 52!

Two histograms are sketched below. One shows the distribution of age at death from natural causes (heart disease, cancer, and so forth). The other
shows age at death from trauma (accident, murder, suicide). Which is which, and why?!
(i), the histogram with a long left tail, shows the distribution of age at death from natural causes because people who are older are more likely to die of
natural causes than younger people, as they are at higher risk of contracting heart disease, cancer, and so forth. !
(ii), the histogram with a long right hand tail, shows age at death from trauma. Accidents, murders, and suicides are more likely to afflict people who are

!
younger than those who are older, so there is a larger number of people towards the left (younger) end of the curve. !

!
#4 page 74!

!
For persons age 25 and over in the US, would the average or the median be higher for income? for years of schooling completed?!

For persons age 25 and over in the US, the average would be larger than the median. This is because there are older people who are extremely rich
who stretch out the histograms tail to the right, lending it a long right hand tail. This affects the average, because it means that the balance point lies
more towards the right to balance out the tail, so the average is bigger than the median.!
For years of schooling, we have the opposite. The left hand tail is longer, thus giving a smaller average than the median. Most people ages 25 and
older in the US have completed all the way up to 12th grade, thus the graph has its peak at approximately that area, and promptly decreases (and
there are only so many years of school for students). Fewer people drop out or are done with school at an earlier period, thus the left hand tail is long -

!!
most people complete up to the end of high school. !

!
#6 page 75!

a. !
(i) has an average of 60!
(ii) has an average of 50!

!
(iii) has an average of 40!

b. The median is less than the average iii!


The median is about equal to the average ii!

!
The median is bigger than the average i!

c. The SD of histogram iii is around 15 because the area is mostly within 50 units of the average, however 50 is too large.However, 5 units from the
average is too small a portion of the area, so by process of elimination and eyeing the extent of the graph and its region of majority-area, the SD is

!
about 15.!

d. The SD for histogram (i) is NOT a lot smaller than that for histogram (iii), thus making the original statement false because the histograms are
essentially mirrors of one another, and can therefore have the same SD - their relative spreads are the same, which is what defines and constitutes the

!!
SD of a histogram.!

extra prob: A list of transactions contains 100 numbers: 60 gains and 40 losses. The gains are positive numbers and the the losses are negative

!
numbers. The units are thousands of dollars. For the 60 gains, the average is 18 and the SD is 7.5. For the losses the average is -20 and the SD is 9.2.!

!
a) Find the average of the 100 transactions !

[(18)(60/100)] + [(-20)(40/100)] = !
10.8 + (-8) = !

!
2.8!

!
b) Find the SD of the 100 transactions. !

S^2 = (60/100)(7.5)^2 + (40/100)(9.2)^2 + (60/100)(18)^2 + (40/100)(-20)^2 - (2.8)^2!


= 33.75 + 33.856 + 194.4 + 160 - 7.84 !
= 414.166!

!!
S = 20.35!

!
CHAPTER 5: The Normal Approximation for Data!
I. THE NORMAL CURVE!
i. The standard normal (or Gaussian) curve is an ideal histogram to which we will compare other data.!
ii. The total area under the curve is 1 (or 100%).!
iii. The curve is symmetric about the line x = 0.!

! iv. It has mean = 0 and SD = 1.!

II. STANDARD UNITS!


i. If x1, . . . , xn is a list of numbers, we convert the values in the list to standard units using the following formula:!
ii. zi = (xi mean) / SD!
iii. zi measures how far (in units of SD) xi is from the mean (average) of the list!
iv. Example: Consider the list 13, 9, 11, 7, 10.!
>! x < c(13, 9, 11, 7, 10)!
>! mean(x) = 10!
>! SD(x) = 2!
Now convert 13 to standard units:!
>! (13-10)/2 -> (NUMBER - AVERAGE) / 2!

!
! = 1.5!

!! v. Method: Convert to standard units, then find the corresponding area under the normal curve. !

!
#1 page 82 !

On a certain exam, the average of the scores was 50 and the SD was 10. !
(a) Convert each of the following scores to standard units: 60, 45, 75!
60: !
zi = (xi mean) / SD!
zi = (60 50) / 10!

!
zi = 1!

45:!
zi = (xi mean) / SD!
zi = (45 50) / 10!

!
zi = -0.5!

70:!
zi = (xi mean) / SD!
zi = (75 50) / 10!

!
zi = 2.5!

(b) Find the scores which in standard units are 0, +1.5, -2.8!
0:!
zi = (xi mean) / SD!
0 = (xi 50) / 10!

!
xi = 50!

+1.5:!

!
zi = (xi mean) / SD!

1.5 = (xi 50) / 10!

!
xi = 65!

-2.8:!
zi = (xi mean) / SD!
-2.8 = (xi 50) / 10!

!!
xi = 22!

III. FINDING AREAS UNDER THE NORMAL CURVE!


i. Use one or more of the following to find areas under the normal curve:!
i. Total area under the curve is 1 (that is, 100%)!
ii. The area is symmetric about vertical the line x = 0!
iii. (area to the left of x) = (1 area to the right of x)!
iv. The 68%, 95%, 99.7% rules (see slide 2)!
v. The pnorm or qnorm functions in R!

! vi. Normal table in Appendix A of your text!

#1a,b,c page 84!


Find the area under the normal curve!
(a) to the right of 1.25!
.5 (100% - 78.87%) = 11%!
(b) to the left of -0.40!
.5 (100% - 31.08%) = 34%!
(c) to the left of 0.80!

!
50% + .5(57.63%) = 79%!

IV. THE NORMAL APPROXIMATION FOR DATA!


i. For the normal curve,!
a. 68% of the area is within 1 SD of the mean!
b. 95% of the area is within 2 SDs of the mean!
c. 99.7% of the area is within 3 SDs of the mean!
ii. The same area vs. SD rule roughly holds for histograms generated by many other data sets.!
iii. From another perspective, for many data sets, if we convert the data to standard units, the histogram will look a lot like the normal curve.!
iv. Example:!
V. PERCENTILES!

i. For data that does not follow the normal curve, we use percentiles, quartiles and related descriptive statistics to summarize the data.!
ii. The 25th percentile is a number x for which 25% of the data is less than x!
iii. The 50th percentile is a number x for which 50% of the data is less than x (this is the same as the median)!
iv. The 75th percentile is a number x for which 75% of the data is less than x!
v. 1st percentile means that 1% of people have below, and 99% have above. !
vi. 10th percentile means that 10% had below that level, and 90% of people are above that.!
vii. 50th percentile = median!
viii. INTERQUARTILE RANGE = 75th percentile - 25th percentile!

!
ix. percentiles - used for distributions with long tails. !

VI. PERCENTILES AND THE NORMAL CURVE!

!
i. !

!
#1 page 93 review exercises (in class we did 1.33 SD)!

!
The following list of test scores has an average of 50 and an SD of 10:!

(a) Use the normal approximation to estimate the number of scores within 1.25 SDs of the average.!
First, visit z table to determine what area (percentage) 1.25 corresponds to: 79%. This is the between -1.25 and 1.25 on the normal table. Multiplying
79% (the area between -1.25 and 1.25) with the number of entries on the given list (25), we get 25 x .79 which is equal to about 20. Thus, I estimate

!
that there will be about 20 numbers within 1.25 SDs of the average.!

(b) How many scores really were within 1.25 SDs of the average?!
To determine the number of scores exactly between 1.25 SDs of the average, I have to convert 1.25 SDs from the normal distribution to the given list.
Because the SD is 10, 1.25 SD away from the given average is 12.5. Then, I add 12.5 to the mean (50) and subtract 12.5. That gives me 37.5 and

!!
62.5. By counting the numbers between 37.5 and 62.5 on the list, there are 18 scores. !

!
#11 page 95!

One term, about 700 Stats 2 students at UC Berkeley were asked how many college math courses they took, other than Stats 2. The average number
of courses was about 1.1; the SD was about 1.5. Would the histogram for the data look like (i), (ii), or (iii)? Why?!
The histogram would like like (i) because you cant have a negative number of courses taken, and that eliminates ii and iii. (ii) has a graph that shows
that students might take less than 0 classes (1.1-1.5), and graph (iii) shows that as well, but to a greater extreme. Though it is possible to take less

!
than the average, say 0 courses, its not possible to have taken -.4 classes. !

!
#5 page 65!

For registered students at universities in the US, which is larger: average age or median age?!
Average age is larger because the histogram depicting the registered students at US universities would have a long right hand tail (lots of young people

!
going to school, and fewer older people, but there are still older people going to school nonetheless). !

!
MEASUREMENT ERROR: Chapter 6!

I. Introduction!
i. In the real world, if we measure something several times, we observe different values each time.!
ii. Each result is thrown off by chance error.!
iii. How do these errors arise?!

! iv. How big are they likely to be?!

II. Chance Error!


i. Standards weights are maintained at local, state, national and international levels for commercial, scientific and other purposes.!
ii. The International Bureau of Weights and Measures near Paris maintains the International Prototype Kilogram.!
iii. The National Bureau of Standards in Washington, D.C. maintains a national prototype kilogram (Kilogram #20) that is calibrated against the
international standard.!
iv. The Bureau maintains several other standard weights that are calibrated against Kilogram #20.!
v. NB 10 is one such standard weight. It weighs very nearly 10 grams.!
vi. NB 10 - NB 10 is a 10 gram weight maintained by the National Bureau of Standards.!
i. The first five NB 10 measurements (in grams) from Table 1 on page 99 are:!
ii. 9.999591 9.999600 9.999594 9.999601 9.999598!
iii. Measurements are in terms of micrograms below 10 grams:!
iv. 409 400 406 399 402!

!
III. Bias!
v. For the measurements in Table 1, mean 405 and SD 6 in micrograms.!

i. Bias affects all measurements the same way, pushing them in the same direction.!
ii. Chance errors change from measurement to measurement, sometimes up and sometimes down.!
iii. (individual measurement) = (exact value) + bias + (chance error)!
iv. If there is no bias, the long-run average of repeated measurements should approach the exact value. Chance errors should cancel

!
Examples:!
out in the average.!

A carpenter is using a tape measure to get the length of the board.!


(a) What are some possible sources of bias?!
(a) Some possible sources of bias include stretching the cloth purposely and to a significant extent, and manufacturing errors by the makers
(of the tape measure itself).!
(b) Which is more subject to bias, a steel tape or a cloth tape?!
(a) Cloth tape is more subject to bias because it stretches with time (systematic error or bias) or can shrink, thus making its measuring
accuracy less reliable, although it is possible for a steel tape to expand with temperature (really high temperatures, though) - its just more
likely in comparison that a cloth tape will be biased.!
(c) Would the bias in a cloth tape change over time?!

! (a) Yes - continuous use of the cloth tape leads to stretching, thus increasing the bias. !

True or false, and explain.!


(a) Bias is a kind of chance error!
(a) False - its not a form of chance error because bias is a systematic and predictable error. !
(b) Chance error is a kind of bias!
(a) False - chance error is not a kind of bias because chance errors change with every measurement you make, but bias shifts measurement
in a singular direction. !
(c) Measurements are usually affected by both bias and chance error. !
(a) True - its not really feasible or possible to see whether or not we have a bias simply by observing the results, so a theoretical comparison

!! point is necessary to figure out if its biased.!

S-ar putea să vă placă și