Documente Academic
Documente Profesional
Documente Cultură
CHAPTER 1
INTRODUCTION
Many scientific studies are featured by the fact that numerous variables are used to
characterize objects [1]. Examples are studies in which questionnaires are used that consist of a
lot of questions (variables), and studies in which mental ability is tested via several subtests, like
skills tests, logical reasoning ability tests, etc. [2]. Because of these big numbers of variables
that are into play, the study can become rather complicated in the sense that as we add more and
more variables, more and more overlap. For situations such as these, Exploratory Factor Analysis
(EFA) has been invented. Broadly speaking, factor analysis provides the tools for analyzing the
structure of interrelationships among a large number of variables by defining sets of variables
that are highly interrelated, known as factors which are assumed to represent dimensions within
the data and to partially or completely replace the original set of variables [5].
This paper explores the use of EFA as a variable-reduction multivariate technique.
Further, it assesses results on different data formats.
The goal of this paper is to discuss common practice in studies using EFA and provide
practical information on best practices in the use of EFA. In particular we discuss three issues:
(1) To determine what type of data format is consistent for EFA technique.
(2) To evaluate which type of factor rotation is the most appropriate for EFA.
(3) To assess results on split-sampling of EFA.
2
1.1 Preliminaries
A hypothesis is a tentative theory which aims to explain facts about the real world. In
statistics, a statistical hypothesis is a conjecture about a population parameter. This conjecture
may or may not be true. This parameter is a characteristic or measure obtained by using all the
data values for a specific population. This particular hypothesis includes the null hypothesis and
alternative hypothesis. For instance, when the statement indicates there is no difference between
two parameters, the hypothesis is a null hypothesis denoted by H
0
; but where there is a difference
between two parameters, the hypothesis is an alternative hypothesis denoted by H
1
[8].
Let n be the number of observations for two variables x and y. As defined in [7], a
correlation, denoted by r, is given by
(
)(
][
]
, is a single number
that describes the strength of relationship of x and y. A zero correlation indicates no relationship
between x and y. When x and y have a positive correlation, x and y move in the same
direction, i.e., as x increases, y also increases or the other way around. On the other hand, when
x and y have a negative correlation, x and y move in the opposite direction. That is, when x
increases, y decreases or vice versa. In addition, a partial correlation is the relationship of two
variables when the effects of the two or more related variables are removed. A correlation
matrix is a symmetric matrix showing the intercorrelations among all variables. Its diagonal has
a uniform correlation value of 1.000 which is the correlation of the variable within itself. The
number of correlations (m) in a correlation matrix is given by
, where n is the
number of variables.
3
To exemplify the concept of correlation and partial correlation, consider the 26 variables
of the prelim examination results in Table 1 and Table 2 of Appendix B. Table 1 is the
correlation matrix of the 26 variables which has 325 correlations and on its diagonal is a uniform
correlation value of 1.00 which is the correlation of each variable onto itself. Moreover, variables
X
16
(adding unlike terms) and X
18
(incorrect application of DPMA) has a correlation value of
0.96 which implies that X
16
and X
18
is 96 percent highly and positively correlated (i.e., X
16
and
X
18
move in the same direction that when X
16
increases, X
18
also increases or when X
16
decreases,
X
18
also decreases). On the other hand, Table 2 presents the partial correlations of the variables.
Here,
When a certain study has been conducted, one of the goals of the researcher is to describe
the characteristics of the data set. First attempt is usually on the descriptive measures of the data,
that is, the measures of central tendency which serves to locate the center of the data set and the
measures of dispersion.
4
Let x
1
, x
2, .
. . , x
n
be the n observations. The measures of central tendency include mean,
median and mode. The median of n observations according to [11] can be defined as the
middlemost value once the data are arranged according to size. More precisely, if n is an odd
number, the median is the value of the observation numbered
and
.
On the other hand, one measure of the data dispersion is the sample variance, denoted by
s
2
, is given by