Documente Academic
Documente Profesional
Documente Cultură
65 ■ March 2017
Definition of Correlation,
its Assumptions and the Positive Correlation No correlation Negative
Correlation Coefficient (1a) (1b) (1c)
Correlation, also called as Fig. 1: Scatter Plot showing Correlation between two variables. Note: Fig. 1a
correlation analysis, is a term shows a weak positive correlation, Fig. 1b shows no correlation and Fig.
used to denote the association 1c shows a weak negative correlation
or relationship between two (or
more) quantitative variables. This
analysis is fundamentally based on Department of Clinical Pharmacology, Seth GS Medical College & KEM Hospital, Mumbai, Maharashtra
Received: 04.12.2016; Accepted: 10.12.2016
the assumption of a straight –line
Journal of The Association of Physicians of India ■ Vol. 65 ■ March 2017 79
score obtained by the participant of the total score on the Y axis get a feel of the relationship (if any)
[presented as a number ]. with the time taken to administer between the two. Each point on the
The scatter plot or scatter diagram consent on the X axis, enables us to scatter plot represents the values of
X and Y as a single coordinate. The
Correlation Coefficient closer the points are to a straight
line, the stronger is the linear
Shows Strength & Direction of Correlation relationship between two variables.
Two scatter plots, one for each
Strong Weak Weak Strong group can be easily constructed
using Microsoft Excel and those
-1.0 -0.5 0.0 +0.5 +1.0 for our example are shown below.
Negative Zero Positive Both scatter plots from our
s t u d y s h o w a we a k , p o s i t i ve ,
Correlation Correlation linear relationship between the
Fig. 2: The spectrum of the correlation coefficient (-1 to +1) total scores and the time taken to
administer the consent.
Table 1
The advantage of the scatter plot
Group 1 Time to Group 2 Time to is that it is simple to construct, is
Participant Written informed administer WIC AV consent administer non-mathematical in nature and is
Number consent [WIC] [minutes] Total Score AV consent
Total score [n=21] [minutes] unaffected by any extreme values
[n=17] that may be present in the data set.
1 30 73 44 75 It also tells us immediately if there
2 29 28 37 42 are outliers or if the relationship
3 42 25 44 20 is actually non-linear or not
4 40 30 42 20 entirely linear. A line is usually
5 40 30 46 55 drawn through the points on a
6 43 35 38 90 scatter plot to identify linearity
7 29 50 43 30 in the relationship. This line is
8 36 55 38 73 called the regression line or the
9 38 55 46 104 least squares line, because it is
10 43 60 45 81 determined such that the sum of
11 46 68 44 149 the squared distances of all the data
12 41 55 42 60 points from the line is the lowest
13 34 80 41 58 possible. This will be discussed in
14 35 85 39 54
greater detail in the next article on
15 27 35 44 120
regression analysis.
16 21 35 37 60
17 19 30 38 80 The disadvantage of a scatter
18 43 60 plot is that it does not give us one
19 23 58 single value that will help us to
20 45 80 understand whether or not there is
21 27 70 a correlation between the variables
Scatter Plot for Group 1 Scatter Plot for Group 2 (AV Consent)
(Written Informed Consent) 160
90 140
80
120 y = 0.7852x + 36.892
70 y = 0.4639x + 32.582
R2 = 0.03549 R2 = 0.02265
100
Total score
60
50 80
40
60
30
20 40
10 20
0
0 5 10 15 20 25 30 35 40 45 0
0 5 10 15 20 25 30 35 40 45 50
Time to consent (mins)
Time to consent (mins)
Scatter plot 1: Written informed consent [Total score vs. Scatter plot 2: AV consent group [Total score vs. time to
time to administer consent] administer consent]
80 Journal of The Association of Physicians of India ■ Vol. 65 ■ March 2017
being studied and hence we need example, when the variables in also by whether the relationship is
to go a step ahead now to calculate the two groups were tested for “significant” [given by the p value].
a correlation coefficient. normality and were found not to Hence testing for significance
follow a normal distribution, we answers the question “how reliable
Calculating the Correlation calculated the Spearman’s rho (ρ). is the correlation analysis?”
Coefficients - Karl The ρ value obtained in our study When we calculate correlation
for the written informed consent
Pearson’s Correlation Co- coefficients from the given data,
group was 0.2 while that for the AV what we calculate really are the
efficient r and Spearman’s consent group was 0.15. sample correlation coefficients.
Correlation Co-efficient Figure 2 describes the We now need to apply “tests of
rho (ρ) interpretation of this correlation significance”6 to see how close these
coefficient and places the sample correlation coefficients
A correlation coefficient relationship in perspective. In our are to the true population value;
is that single value or number case, the values of 0.2 and 0.15 i.e., the population correlation
which establishes a relationship indicate a weak positive correlation coefficients. Both the p values
between the two variables being between the two variables obtained in our study were > 0.05
studied. Two methods are used interpreted to mean that the time indicating a lack of a significant
to calculate this value, viz. the taken to administer consent is r e l a t i o n s h i p b e t we e n t h e t i m e
Karl Pearson’s product moment weakly, though positively related taken to administer consent and
correlation coefficient r or more to the understanding of consent as the total score. It is important to
simply Karl Pearson’s correlation assessed by the total scores. remember here that if the sample
coefficient r and the Spearman’s size is sufficiently large, even small
When the relationship or
rank correlation coefficient rho (ρ) correlation coefficients will achieve
association between more than
or Spearman’s rho (ρ) in short. statistical significance without
two quantitative variables is to
The Pearson’s correlation be studied, other correlation being clinically meaningful.
coefficient establishes a relationship coefficients such as the
between the two variables based on Sample Multiple Correlation
Coefficient of
three assumptions. These are- Coefficient can be used Determination – r2 [r
a. Relationship is linear square]
b. Variables are independent of
What Correlation
each other Coefficients do NOT do This is the square of the
coefficient of correlation r 2, which
c. V a r i a b l e s a r e n o r m a l l y Correlation coefficients do not is calculated by squaring the value
distributed. 4 give information about whether of the “r” obtained. In our study,
On the other hand, one variable moves in response this would be 0.2 x 0.2 = 0.04 or 4%
the Spearman’s rho (ρ) is based on to another. There is no attempt for the written, informed consent
the ranks given to the observations to establish one variable as group and 0.15 x 0.15 = 0.02 or 2%
and not on their actual values and “dependent” and the other as for the AV Consent group. This
is used when the assumptions of “independent”. We shall discuss would mean that only 4% and
the Pearson’s coefficient are not the concept of independent and 2% variability respectively in the
met. It can be thus considered dependent variables in the next total score can be accounted for by
as the non-parametric equivalent article on regression analysis. the time taken to administer the
of the Pearson’s coefficient. This Relationships identified using consent.
is a robust coefficient and can also correlation coefficients should
be used when one of the variables be interpreted for what they Correlation and Causation
is ordinal 4 in nature. For example, are: associations, and not causal
if you want to find the relationship One common error that often
relationships (see below).
between the weight (measured occurs is confusing correlation
in kg, continuous, quantitative Testing for Significance with causation. All that correlation
shows is that the two variables are
data) and socioeconomic stratum after Calculating the associated and nothing more. Any
(ordinal data – higher, middle,
Correlation Coefficients judgment regarding cause and
lower, etc.) the Spearman rho (ρ)
could be used. effect must be made on the basis
Any relationship or association
of the investigator’s knowledge
Normality, we know from an between two variables should be
and biological plausibility. This is
earlier article on distributions assessed not just for the strength
easily seen in an interesting study
is commonly tested using the and direction [as given by the
by Messerli FH 7 who showed that
Kolmogorov Smirnov test. 5 In this correlation coefficients r or ρ], but
Journal of The Association of Physicians of India ■ Vol. 65 ■ March 2017 81
greater a country’s annual per capita v. If the dataset has two distinct variables are normally distributed
chocolate consumption, more were subgroups of individuals whose w e u s e Pe a r s o n ’ s c o r r e l a t i o n
the number of Nobel Laureates per values for one or both variables coefficient “r”. Otherwise, we use
10 million population and thus differ considerably from each Spearman’s correlation coefficient
established a “relationship” or other, a false correlation may rho (ρ), which is non–parametric
“association” between chocolate be found, when none may in nature, and is more robust
consumption and getting a Nobel exist. An example given by to outliers than is the Pearson’s
prize! Aggarwal and Ranganathan 8 correlation coefficient “r”.
illustrates this point well. If Correlation analysis is seldom
Factors that Affect a you were to plot heights (on used alone and is usually
Correlation Analysis X-axis) and hemoglobin levels accompanied by the regression
(on Y-axis), of a group of men analysis. The difference between
Several factors must be (n=20) and women (n=20), most correlation and regression lies in
considered when a correlation women may end up in the the fact that while a correlation
analysis is planned. These include: left lower corner (shorter and analysis stops with the calculation
i. Correlation analysis should not lower hemoglobin) and most of the correlation coefficient and
be used when data is repeated men in the right upper corner perhaps a test of significance, a
measures of the same variable (taller and higher hemoglobin). regression analysis goes ahead to
from the same individual Analysis would suggest a expresses the relationship in the
at the same or varied time relationship with a positive form of an equation and moves into
points. For example, if you “r” value between height and the realm of prediction. The next
have measured pain scores hemoglobin levels! article in the series will deal with
in patients with Rheumatoid vi. The sample size should b e regression analysis.
arthritis at monthly intervals appropriately calculated à
over 2 years in a study, it is priori.9 Small sample sizes References
inappropriate to find out a m a y s h o w a f a l s e p o s i t i ve 1. Gogtay NJ, Deshpande S, Thatte UM.
correlation coefficient for this relationship. Measures of Association. J Assoc Phy Ind
data. 2016; [in press]
vii. If one data set forms part of the
ii. It is useful to draw a scatter second data set, for example, 2. Deshpande S, Gogtay NJ, Thatte UM. Data
plot as an important pre- height at age 12 (X - axis) and types. J Assoc Phy Ind 2016; 64:64-65.
requisite to any correlation height at age 30 (Y-axis) we 3. Figer BH, Chaturvedi M, Thaker SJ, Gogtay
analysis as it helps eyeball would expect to find a positive NJ, Thatte UM. A comparative study of the
the data for outliers, non- informed consent process with or without
correlation between them audio-visual recording. Nat Med J Ind 2017;
linear relationships and because the second quantity in press.
heteroscedasticity “contains” the first quantity. 4. Deshpande S, Gogtay NJ, Thatte UM. Data
iii. An outlier is essentially an viii. Heteroscedasticity is a situation types. J Assoc Phy Ind 2016; 64:64-5.
infrequently occurring value i n w h i c h o n e va r i a b l e h a s 5. Gogtay NJ, Deshpande S, Thatte UM.
in the data set. It is important u n e q u a l va r i a b i l i t y a c r o s s Normal distributions, p values and
to remember that even a single t h e r a n g e o f va l u e s o f t h e confidence intervals. J Assoc Phy Ind 2016;
outlier can dramatically alter 64:74-6.
second variable. For instance,
the correlation coefficient. if one were to plot time on 6. Deshpande S, Gogtay NJ, Thatte UM. Which
test where? J Assoc Phy Ind 2016; 64:64-66.
iv. I f t h e r e i s a n o n - l i n e a r the X-axis and the Sensex on
7. Messerli FH. Chocolate consumption,
relationship between the the Y-axis, one would find a
cognitive function and Nobel Laureates. N
quantitative variables, great variability in the Sensex Engl J Med 2012; 367:1562-4.
correlation analysis should as compared to the relative
8. Aggarwal R, Ranganathan P. Common
not be performed. For example, stability in time. pitfalls in statistical analysis: The use of
during the growth phase in Conclusion correlation techniques. Perspect Clin Res
adolescence, there would a In summary, correlation
2016; 7:187-90.
linear relationship between coefficients are used to assess the 9. Gogtay NJ, Thatte UM. Samples and their
height and weight, as both strength and direction of the linear
sizes- the bane of researchers. J Assoc Phy
increase. However, this Ind 2016; 64:68-71.
relationships between pairs of
relationship ceases once a continuous variables. When both
person enters adulthood.