Sunteți pe pagina 1din 4

78 Journal of The Association of Physicians of India ■ Vol.

65 ■ March 2017

Statistics for Researchers

Principles of Correlation Analysis


NJ Gogtay, UM Thatte

Introduction [linear ] relationship between the with the construction of a scatter


quantitative variables. Similar to plot or scatter diagram [a graphical

T he field of medicine often


requires drawing inferences
regarding the association or
the measures of association for
binary variables, it measures the
“strength” or the “extent” of an
representation of the data] with one
variable on the X-axis and the other
on the Y-axis. Let us understand
relationship between two or more association between the variables this with an example.
va r i a b l e s . I n a n e a r l i e r a r t i c l e and also its direction. We had carried out a study 3
on “Measures of Association” The end result of a correlation earlier that evaluated whether two
we i n t r o d u c e d t h e c o n c e p t o f analysis is a Correlation coefficient modalities of the informed consent
finding associations [relationships] whose values range from -1 to process – the written informed
between two variables that +1. A correlation coefficient of +1 consent process, and the audio
were binary and categorical in indicates that the two variables visual [AV] recording of this (in the
n a t u r e . 1 T h e r e i n , we e x p l o r e d are perfectly related in a positive same clinical trial) were different
s e ve r a l p o s s i b l e r e l a t i o n s h i p s [linear ] manner, a correlation from each other in terms of the
between these binary variables coefficient of -1 indicates that two extent of understanding of the
and understood metrics such as variables are perfectly related study by the participant using a
absolute risk, relative risk and in a negative [linear ] manner, pre-validated questionnaire. This
odds ratio. wh ile a correlat ion coefficient questionnaire gave a “total score”
In the present article, we discuss of zero indicates that there is no [a quantitative measure] at the end
how to establish a relationship linear relationship between the two of administration. One of the study
or an association between two variables being studied. These are objectives was to see if there was a
quantitative variables, i.e., depicted in Figures 1 and 2. relationship between the time (in
variables that can be “measured”. 2 minutes) taken to administer the
As an example, we could perhaps Eyeballing and Analyzing consent in the two groups [again
ask the question “Is there a the Data for Correlation - a quantitative measure] and the
relationship between the number total score. Table 1 gives data on
Construction of the Scatter
of hours of work put in by a sales individual participants in both
representative and the actual sales Plot/Scatter Diagram groups for time taken to consent
of a product?” Or “Is there a [measured in minutes] and the total
A correlation analysis begins
relationship between maternal age
[measured in years] and parity r = 0.4 r=0 r = -0.4
[total number of pregnancies that
a woman has carried past 20 weeks
of pregnancy]? Correlation analysis
helps answer questions such as
these.

Definition of Correlation,
its Assumptions and the Positive Correlation No correlation Negative
Correlation Coefficient (1a) (1b) (1c)
Correlation, also called as Fig. 1: Scatter Plot showing Correlation between two variables. Note: Fig. 1a
correlation analysis, is a term shows a weak positive correlation, Fig. 1b shows no correlation and Fig.
used to denote the association 1c shows a weak negative correlation
or relationship between two (or
more) quantitative variables. This
analysis is fundamentally based on Department of Clinical Pharmacology, Seth GS Medical College & KEM Hospital, Mumbai, Maharashtra
Received: 04.12.2016; Accepted: 10.12.2016
the assumption of a straight –line
Journal of The Association of Physicians of India ■ Vol. 65 ■ March 2017 79

score obtained by the participant of the total score on the Y axis get a feel of the relationship (if any)
[presented as a number ]. with the time taken to administer between the two. Each point on the
The scatter plot or scatter diagram consent on the X axis, enables us to scatter plot represents the values of
X and Y as a single coordinate. The
Correlation Coefficient closer the points are to a straight
line, the stronger is the linear
Shows Strength & Direction of Correlation relationship between two variables.
Two scatter plots, one for each
Strong Weak Weak Strong group can be easily constructed
using Microsoft Excel and those
-1.0 -0.5 0.0 +0.5 +1.0 for our example are shown below.
Negative Zero Positive Both scatter plots from our
s t u d y s h o w a we a k , p o s i t i ve ,
Correlation Correlation linear relationship between the
Fig. 2: The spectrum of the correlation coefficient (-1 to +1) total scores and the time taken to
administer the consent.
Table 1
The advantage of the scatter plot
Group 1 Time to Group 2 Time to is that it is simple to construct, is
Participant Written informed administer WIC AV consent administer non-mathematical in nature and is
Number consent [WIC] [minutes] Total Score AV consent
Total score [n=21] [minutes] unaffected by any extreme values
[n=17] that may be present in the data set.
1 30 73 44 75 It also tells us immediately if there
2 29 28 37 42 are outliers or if the relationship
3 42 25 44 20 is actually non-linear or not
4 40 30 42 20 entirely linear. A line is usually
5 40 30 46 55 drawn through the points on a
6 43 35 38 90 scatter plot to identify linearity
7 29 50 43 30 in the relationship. This line is
8 36 55 38 73 called the regression line or the
9 38 55 46 104 least squares line, because it is
10 43 60 45 81 determined such that the sum of
11 46 68 44 149 the squared distances of all the data
12 41 55 42 60 points from the line is the lowest
13 34 80 41 58 possible. This will be discussed in
14 35 85 39 54
greater detail in the next article on
15 27 35 44 120
regression analysis.
16 21 35 37 60
17 19 30 38 80 The disadvantage of a scatter
18 43 60 plot is that it does not give us one
19 23 58 single value that will help us to
20 45 80 understand whether or not there is
21 27 70 a correlation between the variables

Scatter Plot for Group 1 Scatter Plot for Group 2 (AV Consent)
(Written Informed Consent) 160

90 140
80
120 y = 0.7852x + 36.892
70 y = 0.4639x + 32.582
R2 = 0.03549 R2 = 0.02265
100
Total score

60
50 80
40
60
30
20 40
10 20
0
0 5 10 15 20 25 30 35 40 45 0
0 5 10 15 20 25 30 35 40 45 50
Time to consent (mins)
Time to consent (mins)

Scatter plot 1: Written informed consent [Total score vs. Scatter plot 2: AV consent group [Total score vs. time to
time to administer consent] administer consent]
80 Journal of The Association of Physicians of India ■ Vol. 65 ■ March 2017

being studied and hence we need example, when the variables in also by whether the relationship is
to go a step ahead now to calculate the two groups were tested for “significant” [given by the p value].
a correlation coefficient. normality and were found not to Hence testing for significance
follow a normal distribution, we answers the question “how reliable
Calculating the Correlation calculated the Spearman’s rho (ρ). is the correlation analysis?”
Coefficients - Karl The ρ value obtained in our study When we calculate correlation
for the written informed consent
Pearson’s Correlation Co- coefficients from the given data,
group was 0.2 while that for the AV what we calculate really are the
efficient r and Spearman’s consent group was 0.15. sample correlation coefficients.
Correlation Co-efficient Figure 2 describes the We now need to apply “tests of
rho (ρ) interpretation of this correlation significance”6 to see how close these
coefficient and places the sample correlation coefficients
A correlation coefficient relationship in perspective. In our are to the true population value;
is that single value or number case, the values of 0.2 and 0.15 i.e., the population correlation
which establishes a relationship indicate a weak positive correlation coefficients. Both the p values
between the two variables being between the two variables obtained in our study were > 0.05
studied. Two methods are used interpreted to mean that the time indicating a lack of a significant
to calculate this value, viz. the taken to administer consent is r e l a t i o n s h i p b e t we e n t h e t i m e
Karl Pearson’s product moment weakly, though positively related taken to administer consent and
correlation coefficient r or more to the understanding of consent as the total score. It is important to
simply Karl Pearson’s correlation assessed by the total scores. remember here that if the sample
coefficient r and the Spearman’s size is sufficiently large, even small
When the relationship or
rank correlation coefficient rho (ρ) correlation coefficients will achieve
association between more than
or Spearman’s rho (ρ) in short. statistical significance without
two quantitative variables is to
The Pearson’s correlation be studied, other correlation being clinically meaningful.
coefficient establishes a relationship coefficients such as the
between the two variables based on Sample Multiple Correlation
Coefficient of
three assumptions. These are- Coefficient can be used Determination – r2 [r
a. Relationship is linear square]
b. Variables are independent of
What Correlation
each other Coefficients do NOT do This is the square of the
coefficient of correlation r 2, which
c. V a r i a b l e s a r e n o r m a l l y Correlation coefficients do not is calculated by squaring the value
distributed. 4 give information about whether of the “r” obtained. In our study,
On the other hand, one variable moves in response this would be 0.2 x 0.2 = 0.04 or 4%
the Spearman’s rho (ρ) is based on to another. There is no attempt for the written, informed consent
the ranks given to the observations to establish one variable as group and 0.15 x 0.15 = 0.02 or 2%
and not on their actual values and “dependent” and the other as for the AV Consent group. This
is used when the assumptions of “independent”. We shall discuss would mean that only 4% and
the Pearson’s coefficient are not the concept of independent and 2% variability respectively in the
met. It can be thus considered dependent variables in the next total score can be accounted for by
as the non-parametric equivalent article on regression analysis. the time taken to administer the
of the Pearson’s coefficient. This Relationships identified using consent.
is a robust coefficient and can also correlation coefficients should
be used when one of the variables be interpreted for what they Correlation and Causation
is ordinal 4 in nature. For example, are: associations, and not causal
if you want to find the relationship One common error that often
relationships (see below).
between the weight (measured occurs is confusing correlation
in kg, continuous, quantitative Testing for Significance with causation. All that correlation
shows is that the two variables are
data) and socioeconomic stratum after Calculating the associated and nothing more. Any
(ordinal data – higher, middle,
Correlation Coefficients judgment regarding cause and
lower, etc.) the Spearman rho (ρ)
could be used. effect must be made on the basis
Any relationship or association
of the investigator’s knowledge
Normality, we know from an between two variables should be
and biological plausibility. This is
earlier article on distributions assessed not just for the strength
easily seen in an interesting study
is commonly tested using the and direction [as given by the
by Messerli FH 7 who showed that
Kolmogorov Smirnov test. 5 In this correlation coefficients r or ρ], but
Journal of The Association of Physicians of India ■ Vol. 65 ■ March 2017 81

greater a country’s annual per capita v. If the dataset has two distinct variables are normally distributed
chocolate consumption, more were subgroups of individuals whose w e u s e Pe a r s o n ’ s c o r r e l a t i o n
the number of Nobel Laureates per values for one or both variables coefficient “r”. Otherwise, we use
10 million population and thus differ considerably from each Spearman’s correlation coefficient
established a “relationship” or other, a false correlation may rho (ρ), which is non–parametric
“association” between chocolate be found, when none may in nature, and is more robust
consumption and getting a Nobel exist. An example given by to outliers than is the Pearson’s
prize! Aggarwal and Ranganathan 8 correlation coefficient “r”.
illustrates this point well. If Correlation analysis is seldom
Factors that Affect a you were to plot heights (on used alone and is usually
Correlation Analysis X-axis) and hemoglobin levels accompanied by the regression
(on Y-axis), of a group of men analysis. The difference between
Several factors must be (n=20) and women (n=20), most correlation and regression lies in
considered when a correlation women may end up in the the fact that while a correlation
analysis is planned. These include: left lower corner (shorter and analysis stops with the calculation
i. Correlation analysis should not lower hemoglobin) and most of the correlation coefficient and
be used when data is repeated men in the right upper corner perhaps a test of significance, a
measures of the same variable (taller and higher hemoglobin). regression analysis goes ahead to
from the same individual Analysis would suggest a expresses the relationship in the
at the same or varied time relationship with a positive form of an equation and moves into
points. For example, if you “r” value between height and the realm of prediction. The next
have measured pain scores hemoglobin levels! article in the series will deal with
in patients with Rheumatoid vi. The sample size should b e regression analysis.
arthritis at monthly intervals appropriately calculated à
over 2 years in a study, it is priori.9 Small sample sizes References
inappropriate to find out a m a y s h o w a f a l s e p o s i t i ve 1. Gogtay NJ, Deshpande S, Thatte UM.
correlation coefficient for this relationship. Measures of Association. J Assoc Phy Ind
data. 2016; [in press]
vii. If one data set forms part of the
ii. It is useful to draw a scatter second data set, for example, 2. Deshpande S, Gogtay NJ, Thatte UM. Data
plot as an important pre- height at age 12 (X - axis) and types. J Assoc Phy Ind 2016; 64:64-65.
requisite to any correlation height at age 30 (Y-axis) we 3. Figer BH, Chaturvedi M, Thaker SJ, Gogtay
analysis as it helps eyeball would expect to find a positive NJ, Thatte UM. A comparative study of the
the data for outliers, non- informed consent process with or without
correlation between them audio-visual recording. Nat Med J Ind 2017;
linear relationships and because the second quantity in press.
heteroscedasticity “contains” the first quantity. 4. Deshpande S, Gogtay NJ, Thatte UM. Data
iii. An outlier is essentially an viii. Heteroscedasticity is a situation types. J Assoc Phy Ind 2016; 64:64-5.
infrequently occurring value i n w h i c h o n e va r i a b l e h a s 5. Gogtay NJ, Deshpande S, Thatte UM.
in the data set. It is important u n e q u a l va r i a b i l i t y a c r o s s Normal distributions, p values and
to remember that even a single t h e r a n g e o f va l u e s o f t h e confidence intervals. J Assoc Phy Ind 2016;
outlier can dramatically alter 64:74-6.
second variable. For instance,
the correlation coefficient. if one were to plot time on 6. Deshpande S, Gogtay NJ, Thatte UM. Which
test where? J Assoc Phy Ind 2016; 64:64-66.
iv. I f t h e r e i s a n o n - l i n e a r the X-axis and the Sensex on
7. Messerli FH. Chocolate consumption,
relationship between the the Y-axis, one would find a
cognitive function and Nobel Laureates. N
quantitative variables, great variability in the Sensex Engl J Med 2012; 367:1562-4.
correlation analysis should as compared to the relative
8. Aggarwal R, Ranganathan P. Common
not be performed. For example, stability in time. pitfalls in statistical analysis: The use of
during the growth phase in Conclusion correlation techniques. Perspect Clin Res
adolescence, there would a In summary, correlation
2016; 7:187-90.
linear relationship between coefficients are used to assess the 9. Gogtay NJ, Thatte UM. Samples and their
height and weight, as both strength and direction of the linear
sizes- the bane of researchers. J Assoc Phy
increase. However, this Ind 2016; 64:68-71.
relationships between pairs of
relationship ceases once a continuous variables. When both
person enters adulthood.

S-ar putea să vă placă și