Documente Academic
Documente Profesional
Documente Cultură
BLOCK 1
Foundations of Research Methodology in Economics 5
BLOCK 2
Quantitative Methods: Data Collection 23
BLOCK 3
Quantitative Methods: Data Analysis 54
BLOCK 4
Qualitative Methods 75
BLOCK 5
Database of Indian Economy 95
BLOCK 6
Use of SPSS and EVIEWS Packages for Analysis
and Presentation of Data 130
2
EXPERT COMMITTEE
Prof. Alakh N. Sharma Prof. Gopinath Pradhan
Director, Institute of Human Development Professor of Economics
New Delhi – 110 002. IGNOU, New Delhi
1. Prof. D. Narsimha Reddy 2&5 Sh. S.S. Suryanarayanan Prof. Narayan Prasad
Retd. Professor of Economics Ex. Joint Advisor Professor of Economics
University of Hyderabad Planning Commission IGNOU New Delhi
Hyderabad New Delhi
August 2009.
Indira Gandhi National Open University
All rights reserved. No part of this work may be reproduced in any form, by mimeograph or any other means, without permission in
writing from the Indira Gandhi National Open University.
Further information about the Indira Gandhi National Open University courses may be obtained from the University’s Office at
Maidan Garhi, New Delhi-110 068.
3
In order to pursue research degree programme, you need to be equipped with the various
constituents of Research Methodology and the different techniques applied in data
collection/analysis. The present course aims to cater this need. The theoretical
perspectives that guide research, tools and techniques of data collection and methods of
data analysis together constitute the research methodology. This course deals with all
these aspects. The course comprises of 6 Blocks.
Structure
1.0 Objectives
1.1 Introduction
1.2 An Overview of the Block
1.3 Approaches to Research Methodology: Scientific, Historical and Institutional
1.4 Philosophical Foundations
1.4.1 Positivism
1.4.1.1 Central Tenets of Positivist Philosophy
1.4.1.2 Criticism of Positivism
1.4.2 Post-Positivism
1.4.3 Karl Popper and Critical Rationalism
1.4.4 Thomas Kuhn and Growth of Knowledge
1.4.5 I. Lakotas: The Methodology of Scientific Research Programmes
1.4.6 Paul Feyerabend: Methodological Dadaism and Anarchism
1.5 Models of Scientific Explanation
1.5.1 Hypothetico-Deductive (H-D) Model
1.5.2 Deductive Nomological (D-N) Model
1.5.3 Inductive- Probabilistic (I-P) Model
1.6 Debates on Models of Explanation in Economics
1.6.1 Classical Political Economy and Ricardo’s Method
1.6.2 Hutchison and Logical Empiricism
1.6.3 Milton Friedman and Instrumentalism
1.6.4 Paul Samuelson and Operationalism
1.6.5 Amartya Sen: Heterogeneity of Explanation in Economics
1.7 Research Problem and Research Design
1.7.1 Research Problem
1.7.2 Basic Steps in Framing a Research Proposal
1.8 Further Suggested Readings
1.9 Model Questions
1.0 Objectives
1.1 INTRODUCTION
The block is very ambitiously designed to cover the entire breadth of both the main
trends of development in the philosophy of science as well as the main debates in the
methodology of economics. The first part, beginning with the historical background of
positivism covers up to the contemporary trends in philosophy of science. The second
part is devoted to the main debates in the mainstream economic methodology from the
classical to the contemporary period. Since the focus in this block is substantially on the
scientific method, emphasis primarily is on the methodology of the mainstream neo-
classical economics.
Each section in the block, besides an outline of the subject matter, provides a detailed
reading guide with some critical reading material being included in boxes, wherever it is
felt necessary to draw your attention to a specific reading.
1.4.1 Positivism
The basic tenets of positivism had a relatively long history of evolution over the first half
of the 20th century and went through different phases of development. A question is
often asked whether it is ‘Positivism’ or ‘Positivims’? Positivism in its evolved form is
often called as ‘Logical Positivism’ or ‘Logical Empiricism’. Bruce Caldwell (1982) in
the first chapter of his book attempts to construct the development and basic aspects of
Logical Empiricism. But, for the learners like you intending to know the Methodology of
Science, it would be helpful to familiarize with I. Naletov’s first chapter (esp. pp. 23-58)
which distinguishes three phases of development of Positivism and identify the third
phase of 1940s and 1950s as Logical Positivism. Along with this, you should read
Kolakowski’s ‘Rules of Positivism’ which are reproduced in the first chapter of CGA
Bryant’s Positivism in Social Theory and Research (1985).
K3 The Rule of Value-free Statements: There is no place for value judgments and
normative statements in science. Science is concerned with ‘what is’ and not
‘what ought to be’.
First, the positivist rule of beginning all scientific investigation with facts and facts alone
was questioned by Popper. Knowledge does not start with nothing – tabula rasa – or out
of nothing. “Before we can collect data, our interest in data of a certain kind must be
aroused; the problem always comes first”. There are no brute facts but all theory laden.
Before one collects facts or observes things, one should have relevant questions which
arise from existing knowledge.
Second major criticism against Positivism relates to the problem of induction and the
related problem associated with the test of verification. Popper has given the famous
example of the colour of the swan. If one follows Positivism, it means if one observes
repeatedly without any change in a number of places that the colour of the swan is white,
then one would go to universalize that all ‘swans are white’. But Popper draws attention
to the problem that any number of observations of swans in different locations does not
mean that all swans would be white. There could be one still to be observed which may
turn out to be other than white. This is a typical problem of induction. One cannot verify
the truth of a phenomenon by any number of observations. But one non-white swan can
falsify it. A universal theory can be shown to be false but never proven to be true.
Therefore, verification is not a right test for universalisation. Popper insisted on
falsification test to overcome the problem of induction.
Third, unlike Positivist insistence of universal method, the theoretical idiom of different
sciences varies. Unless there is a specific theory relevant to the subject, facts can not be
expressed in recognizable form. Otherwise, there will be mere description of the
instruments and the activities than underlying pursuit of knowledge. The famous
example of Pierre Duhem is that one working with an oscillating iron bar with a mirror
attached will mean only objects and facts, if one has no theory of measuring electrical
resistances.
These criticisms against Positivism were mainly aimed at the empirical extremes
claiming that observations are theory-independent.
1.4.2 Post-Positivism
As we have seen above, by 1960s, the limitations of Positivism were widely criticized.
Much of the criticism took the form exploring alternative approaches to scientific
methodology which emerged as the Post-Positivist philosophy of science. Post-
Positivism consists of several approaches of which Karl Popper’s ‘Critical Realism’,
Thomas Kuhn’s ‘Growth of Knowledge’, Imre Lakota’s ‘Scientific Research
Programmes’ and Paul Feyerabend’s Criticism ‘Against Method’ are major contributions.
We shall discuss each of these Post-Positivist approaches, their basic contents and the
helpful reading material.
9
Karl Popper, as we have seen above, was the earliest critics of Positivism and over a
period, his writings have emerged as an alternative to Positivism. His approach to
Philosophy of Science is known as ‘Critical Rationalism’. Since writings evolved over a
period of time, one may have to be careful in choosing the reading material. His
contribution may be broadly grouped into (i) Criticism of Positivism, (ii) Basis of
Knowledge, (iii) Problem of Induction and (iv) Methodology of Falsificationism.
Growth of Knowledge
Since we are familiar with Popper’s criticism of Positivism, we shall turn to the basics of
his ‘Critical Rationalism’. According to him, the basis for growth of knowledge is the
existence of ‘critical spirit in society’. He conceives of three autonomous worlds. The
first world, he terms as the ‘physical reality’. The second is the ‘subjective knowledge’
referred to consciousness. It is the third world, which he calls as ‘objective knowledge’
that is the domain of pursuit of science through ‘theory, problems and arguments.
Fallibilism
All existing knowledge is full of errors and falliable. Objective truth exists and there are
ways to recognize it. The advance of knowledge consists of merely in the modification
of earlier knowledge. Real progress of knowledge involves elimination of errors.
Some of the main criticisms against Popper’s ‘falsificationism’ include the following:
One, in the pursuit of scientific knowledge, logical falsifiability turns out to play only a
minute role in the actual process of theory rejection or revision.
Third, since individual scientific theories need not be falsifiable, there is no logical
asymmetry between verifiability and falsifiability of particular scientific theories. They
are neither verifiable nor falsifiable.
10
You may begin with Blaug (1980 pp. 10-17) to understand Popper’s criticism of
Positivism on the counts of verification, and alternative suggestion on the criteria of
demarcation of science from non-science through ‘falsification’ as an approach to
overcome the problems of induction.
O’ Hear’s (1980) book would help you to appreciate the context of his writings on
philosophy of science. Popper’s The Logic of Scientific Discovery would help you, if
you are familiar with an overview of his contributions. Naletov’s (1984) chapter on
Popper is especially useful for understanding his ‘Third World of Knowledge’. The best
for not only critical appraisal but also an in-depth discussion of Popper’s Methodology is
Hausman (1988).
Read: Chapter 1, pp. 10-17, Mark Blaug (1980) The Methodology of Economics,
Cambridge
1.2.6 ThomasUniversity
Kuhn and Press, Cambridge.
Growth of Knowledge Philosophy of Science
Daniel H. Hausman (1988) “An Appraisal of Popperian Methodology” in Neil De
Marchi (ed), The Popperian Legacy in Economics, Cambridge University Press,
Cambridge,
1.4.4 pp. 65-76.
Thomas Kuhn and Growth of Knowledge Philosophy of Science
In the course of practice of ‘normal science’ within a paradigm, there arise anomalies due
to unsolved puzzles. As the anomalies increase, there arises a crisis which puts the
paradigm on trial. With increasing puzzles, there would be paradigm-change which
brings in new ways of analysis, new approaches and new knowledge. The paradigm-
change is like ‘gestalt’ – total vision change and marks a revolutionary break from the
past. Something like shifting from the notion of flat earth to round earth.
11
Kuhn’s second edition (1970) is a lucid piece of writing and you should take it as a basic
reading (see Box). The first few pages give a detailed overview of the contents and
would be of immense help in following the rest of the text. Blaug (1980, pp.29-34)
contains a good summary version and the first part is a good introduction to both Kuhn
and Lakotas. The first two sections of Aidan Foster Carter (1976) is an excellent
summary of Kuhn’s ‘structure of scientific revolutions’.
Imre Lakotas also belongs to contemporary philosophy of science and like Kuhn also
believes in history of science as a guide to explain the methodology of science. His
approach is called ‘Methodology of Scientific Research Programmes’ (MSRP). He
tries to bridge the contributions of Popper and Kuhn. MSRP is considered as a
compromise between a historical aggressive rule-bound methodology of Popper on the
one hand and relativistic, defensive and vindictive methodology of Kuhn. According to
MSRP, validation in science involves not individual theories but clusters or
interconnected theories which may be called scientific research programmes (SRP).
SRPs are not scientific once and for all. SRP may experience ‘progressive’ or
‘degenerative’ phases. These phases contain ‘theoretical’ and ‘empirical’ components. If
successive theoretical programmes contain excessive empirical content then these are
empirically corroborated. The SRP may experience problem-shift from ‘degenerating’ to
‘progressive’ phase, like in psychology. If theory does not lead to much of empirical
content then there will be ‘degenerating’ problem-shift like in astrology. There has been
extensive use of Lakotas MSRP in theory appraisal both in sciences and social sciences
like Economics.
Lakotas is a difficult writer and yet Lakotas and Musgrave (1978) in parts would be
useful reading. Lakota’s paper in the volume (it is lengthy and difficult but) particularly
pp. 132-138, serves as a good introduction. Blaug’s (1980) summary (pp. 34-41) is
helpful. Caldwell (1982) contains a brief summary on Lakotas. For those interested in
the application of Kuhn and Lakotas to theory appraisal in Economics, Latsis (1976) is an
important source.
Read: Imre Lakotas (1970) “Falsification and the Methodology of Scientific Research
Programmes” in Imre Lakotas and A. Musgrave (ed) Criticism and Growth of
Knowledge, Cambridge University Press, Cambridge, (esp. pp. 132-138), pp. 132-
194.
His two important works are: Against Method (1975) and Science in Free Society
(1978). While the former cautions against pitfalls of a rigid method and pleads for
breaking rules, the letter emphasizes the limitations of all methodologies and highlights
the role of humility, tenacity, interactiveness and plurality. Caldwell (1982) has very
useful summary of Feyerabend’s contribution (pp. 79-85). Naletov (1984) provides a
good summary account of Against Method. But one caution is: do not stop with reading
Blaug (1980, pp. 40-44) on Feyerabend. Blaug gives an impression that Feyerabend is a
non-serious and flippant ‘methodologist’. This is only a caricature. The truth is
Feyerabend needs careful attention.
Explanation assumes central place in pursuit of scientific knowledge. The search for
making testability criterion concrete has been a major problem in the philosophy of
science. In principle, at least until Popper raised serious questions, complete verification
of observational evidence was considered meaningful. But, there was always a problem
of strict verifiability and as a solution in terms of confirmation of some of the
experimental propositions. Further developments in this direction resulted in the
development of rules of correspondence between theoretical terms and observation terms.
Out of this emerged an explanatory system called Hypothetico-Deductive Model (H-D
Model). Scientific theories have three components: an ‘abstract calculus’, a set of rules
that assign empirical content to the ‘abstract calculus’ and a model for explaining the
abstract calculus. The H-D Model explicitly addresses the problems of a theory’s
structure. The H-D Model by reducing the strict Positivist correspondence principle
between science and observable phenomena, allowed substantial role for theories and
theoretical terms. But, theories in these models were continued to be treated as
13
eliminative fiction and considered establishing correlations among phenomena was all
science could and should do. In fact early positivists considered that explanations had no
role in science.
This counter intuitive approach to scientific explanation was eventually replaced by the
contribution of Hempel and Opperheim who developed Deductive-Nomological (D-N)
Model or what is called ‘Covering Law Models’. However, it realized that many
explanations in science because they make use of statistical laws, cannot be adequately
accounted for by D-N Model. Later Hempel developed the ‘inductive-probabilistic’ (I-P)
model. In I-P model the explanations consisted of statistical laws, Covering Law Models
too came for criticism specifically for the explanation ‘symmetry thesis’ or symmetry
between explanation and prediction and also the claims that these explain adequately
almost all legitimate phenomena in natural and social sciences.
The basic developments leading to the development of H-D Model and its limitations are
very well summarized in Caldwell (1982, pp. 23-32). This also provides an excellent
summary of Carl Hempel’s emphasis on many positive functions of theories. For a more
detailed discussion you may go through Hempel's collected essays (1965).
Read: Caldwell, B (1982) Beyond Positivism …, George Allen & Unwin, London, pp. .
D-N Model is perhaps the most tenacious of all models of explanations that has survived
much after the decline of Positivism. There is a brief summary on D-N Model in
Hausman (1984, pp. 6-10) also in Caldwell (1982, pp. 28-32 and pp. 54-63). Blaug
(1980, pp. 2-9) provides a summary critique of Covering Law Models. But, there is
substitute for Hempel and Oppenheim paper in Brody (1970).
Carl G. Hempel and Paul Oppenheim (1948) “Studies in the Logic of Explanation”
(pp. 9-20) in Baruch Brody (ed) Readings in the Philosophy of Science,
Prentice Hall, Engelwood Cliffs, New Jersey, 1970.
This is an extension of D-N Model by Hempel for application where statistical laws are
involved. And the basic reading would involve sections referred to above in Caldwell
(1982), Blaug (1980) and Hempel’s collected papers (1965).
As Daniel Hausman observes, ever since its inception in eighteenth century, the science
of economics has been methodologically controversial. There has always been the
14
haunting question whether economics is a science at all? Beginning with the early 1980s,
there has been a resurgence of interest in philosophical and methodological questions
concerning economics. When there are serious doubts expressed about their scientific
credibility, economists appear to be turning to methodological reflection in the hope of
finding some flaw in previous economic study or to find a new methodological directive
that will better guide their work in the future.
We intend to trace the origins of methodological interest in political economy and the
desire to model economics as a science by attempting to adopt the methods of science. In
the process, it is hoped you would be in a position to see the methodological concerns
beginning with classical economics to the present times. It will help you to understand
the consequences of obsession to adopt methods of natural sciences to complex social
science like economics. In the end, you will be in a position to appreciate the limitations
of the present mainstream methodological approach which in spite of decline of
Positivism as a methodology of the natural sciences has overwhelming Positivist
influence.
Let us begin with the methodological position of classical political economy with David
Ricardo’s method. Though Ricardo did not himself write on methodology explicitly, his
writings carried the seal of abstract deductive method that was at length dealt with by his
followers like N.W. Senior, J.S. Mill, J.E. Cairnes and J.N. Keynes. It is followed by the
Neo-Classical School especially Lionel Robbins and the controversy it generated, acting
as a turning point in economic methodology. Then, we thereafter shall turn to
methodological contributions of contemporary prominent mainstream economists that
include Milton Friedman and Paul Samuelson. The last refers to Amartya Sen’s
contribution in explanation of economics.
Mill’s view was influential on the nature of the methodology of economics throughout
the nineteenth and even early twentieth centuries. J.N. Keynes and J.E. Cairnes carried
forward Mill’s methodological views. All this was a period when economics asserted
scientific status on the basis of abstract deductive explanation without any appeal to
testing or verification.
15
Blaug (1980) is very useful but you have to ignore the title of the chapter
“Verificationists”, there are no verificationists but only Senior, Mill, J.N. Keynes and
Cairnes are discussed. There is an excellent discussion of J.S. Mill in Hausman (1981).
J.S. Mill’s essay is reproduced in Hausman (1984) and it is an insightful essay to realize
even at present why economics turns out to be an ‘inexact science’. T. Hutchison (1988)
also provides an overview of the early methodological approaches in economics.
Lionel Robbins’ An Essay on the Nature and Significance of Economic Science (1935) is
a path breaking methodological contribution to economics that held sway for a substantial
part of the first half of the twentieth century and even today serves as the one that defines
the nature of the subject matter of economics. The major objective of the essay was to rid
economics from ethics and normative welfare considerations and approach it as a ‘pure
science’. His contention is to show that economics is a pure theory based on prior
experience and hence no testing is needed. He, therefore, conceives economics as
‘apriori science’ and his approach has been described as ‘apriorism’. At the same time,
he claimed the status of positive science to economics because of his insistence on getting
rid of all normative considerations from economic analysis. His claims of positive
science of pure theory without any need for testing was subjected to extensive criticism
and Hutchison was foremost among the critics. In fact, the latter’s criticism severely
undervalued Robbins’ scientific claims to economics.
Read: Lionel Robbins (1935) The Nature and Significance of Economic Science”
reproduced in Daniel Hausman (1984) The Philosophy of Economics: An Anthology,
Cambridge University Press, Chapter 3, pp. 83-110.
B. Caldwell (1982) Beyond Positivism …., George Allen & Unwin, Part II Chapter 6,
“Robbins Vs Hutchison”, pp. 99-128.
subjected to testing and this earned him the description as ‘radical empiricist’. It also led
to a debate on the testability of assumptions in economics. Aided by other developments
in the improved sources and methods of data collection, it certainly turned economics
more towards empirical research.
Of all the methodological debates in modern economics, the one that revolves round
Milton Friedman’s contribution has far reaching significance because it not only tries to
establish formal foundations of research in economics but also brings into wide view the
limitations and difficulties involved in putting up a scientific façade to economics. His
methodology is known as ‘instrumentalism’ since he considers the function of
assumptions and theory is to get predictions, nothing more.
senses in which ‘unrealism’ is used, one may begin with Nagel in Caldwell (1984). For
an elaboration on this see Boland (1979). Melitz (1965) provides a good summary of the
reasons advanced by Friedman as why search for realistic assumption is futile. Besides,
Melitz also helps us to locate Friedman in the appropriate historical context in the
evolution of economic methodology. For critical remarks and specifically for
characterizing the Friedmanian ‘instrumentalism’ as an ‘F-Twist’ Samuelson in Caldwell
(1984) is very useful. Mason (1980) is a drastic criticism of Friedman, whose work he
calls as “a mythology resulting in methodology”. This critique is with particular
reference to monetary theory.
There is a good summary of the whole debate in Blaug (1980). There is a good
discussion of Friedman’s method along with other shades of empiricism in Eugene
Rotwein (1980). For a short but stimulating contrast of positivist ‘predictive’ approach of
Friedman with that of anti-positivist “assumptive” approach of F.Knight see Abraham
and Eva Hirsh (1980). They trace the Friedmanian approach to Senior – Cairnes. For an
excellent analysis of the Chicago School with Friedman at the centre, tracing the origins
of logical positivism, the insularity of positivism giving rise to a kind of “ideal type” see
C.K. Wilber and Jon D. Wisman (1980).
Bruce Caldwell (1982) provides a brief but succinct summary of Friedman’s essay,
Boland’s restatement and the philosophical rejection of Instrumentalism. Rotwein (1959)
contains a good summary of Friedman’s methodology, followed by critical appraisal.
Melitz (1965) gives an account of the debate on the realism of assumptions and the
significance of testing assumptions. Boland (1979) provides a valiant defence of
Friedman by an attempt to answer every point of criticism.
Read: Bruce Caldwell (1982) Beyond Positivism …, George Allen & Unwin,
Chapter 9, pp. 189-200.
18
Once there is basic clarity on the research problem to be pursued, you may get down to
make a research proposal. There are certain basic steps involved in preparing a research
proposal and this preparation will facilitate smooth sailing in carrying out the research
work. The following are rudimentary steps in preparing a research design.
1. Introduction
“The introduction is the part of the paper that provides readers with the background
information for the research reported in the paper. Its purpose is to establish a framework
for the research, so that readers can understand how it is related to other research”.
Effective problem statements answer the question “Why does this research need to be
conducted.” If a researcher is unable to answer this question clearly and succinctly, and
without resorting to hyperspeaking (i.e. focusing on problems of macro or global
proportions that certainly will not be informed or alleviated by the study), then the
statement of the problem will come off as ambiguous and diffuse.
Objectives should be stated clearly and these should be kept in view throughout the
investigation and analysis. One of the important character of a good price of research
work is that the findings are sharply linked to the set objectives. Since the objectives
provide direction to the entire research work, these should be limited and focused and too
many objectives are likely to be a hindrance to analysis and interpretation.
4. Review of Literature
The review of literature is meant to gain insight on the topic and gain knowledge on the
availability of data and other materials on the theme of proposed area of research. The
literature reviewed may be classified into two types viz. (i) literature relating to the
concepts and theory and (ii) empirical literature consisting of findings in quantitative
terms by studies conducted in the area. This will help in framing research questions to be
investigated. Academic journals, conference proceedings, government reports, books etc.
are the main sources of literature. With the spread of IT, one can access a large volume
of literature through internet.
5. Questions or Hypotheses
Questions are relevant to normative or census type research (How many of them are
there? Is there a relationship between them?). They are most often used in qualitative
inquiry, although their use in quantitative inquiry is becoming more prominent.
Hypotheses are relevant to the theoretical research and are typically used only in
quantitative inquiry. When a writer states hypotheses, the reader is entitled to have an
exposition of the theory that lead to them (and of the assumptions underlying the theory).
Just as conclusions must be grounded in the data, hypotheses must be grounded in the
theoretical framework.
The methods or procedures section is really the heart of the research proposal. The
activities should be described with as much detail as possible, and the continuity between
them should be apparent. There is need to indicate the methodological steps to be taken
to answer every question or to test every hypotheses.
The issues relating to sources of data, nature of data, sampling design, methods of
collection of data, methods of analysis etc. all should be clearly discussed in this section.
There is always a question whether the proposed research leads to ‘value addition’ – in
this case addition to knowledge in the domain of the proposed research. It would be
important to indicate how the proposed research will refine, revise or extend existing
knowledge in the area of investigation.
9. References
Proper documentation is an essential part of any research work. There are number of
style sheets which would be of help for proper references in text and in the reference list.
Abraham and Eva Hirsh (1980) “The Heterodox Methodology of Two Chicago
Economists” in W.J. Samuels (ed.).
Bruce Caldwell (1984) Appraisal and Criticism in Economics, Allen and Unwin, Boston.
Carl Hempel (1965) Aspects of Scientific Explanation and Other Essays in the
Philosophy of Science, Free Press, New York.
C.K. Wilber and Jon D. Wisman, (1980) “The Chicago School: Positivism or Ideal Type”
in W.J. Samuels (ed.).
21
Duncan Hodge (2007) : Economics, realism and reality: a comparison of Maki and
Lawson, Cambridge Journal of Economics, Volume 32 Number 2 March 2008, Oxford
University Press.
L. Boland (1979) “A Critique of Friedman’s Critics” JEL, June 1979, also in Caldwell
(1984).
Neil de Marchi (ed) (1988) The Popperian Legacy in Economics, Cambridge University
Press, Cambridge.
Spiro Latsis (ed) (1976) Method and Appraisal in Economics, Cambridge University
Press, Cambridge.
Stanley Wong (1973) “The F-Twist” and the Methodology of Paul Samuelson, American
Economic Review, June 1973.
W.J. Samuels (1980) The Methodology of Economic Thought, Transactions Books, New
Burnswick.
22
1. Why does the question Positivism or Positivism arise? How does one come round
this problem?
2. According to Popper, what are the limitations of Positivism? Critically examine
Popper’s philosophy of science.
3. How does Kuhn explain growth of knowledge?
4. Why is Lakatos’ contribution called a bridge between Popper and Kuhn?
5. What is the significance of Feyerabend’s tirade ‘against method’?
6. Discuss the evolution of explanatory structures within Positivism.
7. Discuss the contribution of Hempel and Oppenheim to explanatory models.
8. Critically examine the methodological contention of the Classical School.
9. What is apriorism? How does Robbins defend it?
10. Evaluate the methodological contribution of Hutchison.
11. How does one explain the all pervasive appreciation as well as criticism of Milton
Friedman’s methodology of positive economics?
12. Is there a room for methodological heterodoxy in economics? What is the
significance of Sen’s methodological contribution?
23
Structure
2.1 Introduction
2.2 Objectives
2.3 An Overview of the Block
2.4 Method of Data Collection
2.5 Tools of Data Collection
2.6 Sampling Design
2.6.1 Population and Sample Aggregates and Inference
2.6.2 Non-Random Sampling
2.6.3 Random or Probability Sampling
2.6.4 Methods of Random Sampling
2.6.4.1 Simple Random Sampling with Replacement (SRSWR)
2.6.4.2 Simple Random without Replacement (SRSWR)
2.6.4.3 Interpenetrating Sub-Samples (I-PSS)
2.6.4.4 Systematic Sampling
2.6.4.5 Sampling with Probability Proportional to Size (PPS)
2.6.4.6 Stratified Sampling
2.6.4.7 Cluster Sampling
2.6.4.8 Multi-Stage Sampling
2.7 The Choice of an Appropriate Sampling Method
2.8 Let Us Sum Up
2.9 Further Suggested Readings
2.10 Some Useful Books
2.11. Model Questions
2.1 INTRODUCTION
Research is the objective and systematic search for knowledge to enhance our
understanding of the complex physical, social and economic phenomena that surround us.
It involves a scientific study of the variety of factors or variables that shape such
phenomena, the interrelationships amongst them and how these impact on our lives. The
results of such studies give rise to more questions for us to find answers and egg us on to
further research, resulting in the extension of the frontiers of knowledge.
theoretical basis for policy or review of policy, call for objectivity, integrity and
analytical rigour in order to ensure academic and professional acceptability and, above
all, an effective tool to tackle the problem at hand. Data used for research should,
therefore, reflect, as accurately as possible, the phenomena these seek to measure and be
free from errors, bias and subjectivity. Collection of data has thus to be made on a
scientific basis.
2.2 OBJECTIVES
How to assemble data on a scientific basis? There are broadly three different methods of
collecting data. These are dealt with in Section 2.4. The tools that one can use for
collecting data – the formats and the devices that modern technology has provided – are
enumerated in section 2.5. There are situations where it is considered desirable to gather
data from only a part of the universe, or a sample selected from the universe of interest to
the study at hand, rather than a complete coverage of the universe, for reasons of cost,
convenience, expediency, speed and effort. Questions then arise as to the manner in
which such a sample should be chosen – the sampling design. This question is examined
in detail in Section 2.6. The discussion is divided into a number of sub-topics. Concepts
relating to population aggregates like mean and variance and similar aggregates from the
sample and the use of the latter as estimates of population aggregates have been
introduced in sub-section 2.6.1. There are two types of sampling: random and non
random. Non random sampling methods and the contexts in which these are used are
described in sub-section 2.6.2. A random sample has certain advantages over a non
random sample - it provides a basis for drawing valid conclusions from the sample about
the parent population. It enables us to state the precision of the estimates of population
parameters in terms of (a) the extent of their variation or (b) an interval within which the
value of the population parameter is likely to lie with a given degree of certainty. Further,
it even helps the researcher to determine the size of the sample to be drawn if his project
is subject to a sanctioned budget and permissible limits of error in the estimate of the
25
population parameter. These principles are explained in sub-section 2.6.3. Eight methods
of random sampling are then detailed in sub-section 2.6.4. These details relate to(i)
operational procedures for drawing samples, and (ii) expressions for (a) estimators of
parameters and measures of their variation and b) estimators of such variation where the
population variation parameter is not known. Different sampling procedures are also
compared, as we go along, in terms of the relative precision of the estimates, they
generate. Finally, the question of choosing the sampling method that is appropriate to a
given research context is addressed in Section 2.7. A short summing up of the Block is
given in Section 2.8.
Each Section/subsection ends with a box guiding you to relevant portions of one or more
publications that give you more details on the topic(s) handled in it. Fuller details of these
publications are indicated in Section 2.10. Section 2.9 is meant to kindle your interest and
appetite for recent developments in the subject and Section 2.11 for evaluation of your
knowledge of the subject matter covered in this Block.
There are three methods of data collection – the Census and Survey Method, the
Observation Method and the Experimental Method. The first is a carefully planned and
organised study or enquiry to collect data on the subject of the study/enquiry. We might
for instance organise a study on the prevalence of the smoking habit among high school
children – those aged 14 to 17 - in a certain city. One approach is to collect data of the
kind we wish to collect on the subject matter of the study from all such children in all the
schools in the city. In other words, we have a complete enumeration or census of the
population or universe relevant to the enquiry, namely, the city’s high school children
(called the respondent units or informants of the Study) to collect the data we desire. The
other is to confine our attention to a suitably selected part of the population of high
school children of the city, or a sample, for gathering the data needed. We are then
conducting a sample survey. A well known example of Census enquiry is the Census of
Population conducted in the year 2001, where data on the demographic, economic,
social and cultural characteristics of all persons residing in India were collected. Among
sample surveys of note are the household surveys conducted by the National Sample
Survey Organisation (NSSO) of the Government of India that collect data on the socio-
economic characteristics of a sample of households spread across the country.
The Observation Method records data as things occur, making use of an appropriate and
accepted method of measurement. An example is to record the body temperature of a
patient every hour or a patient’s blood pressure, pulse rate, blood sugar levels or the lipid
profile at specified intervals. Other examples are the daily recording of a location’s
maximum and minimum temperatures, rainfall during the South West / North East
monsoon every year in an area, etc.
The Experimental Method collects data through well designed and controlled statistical
experiments. Suppose for example, we wish to know the rate at which manure is to be
applied to crops to maximise yield. This calls for an experiment, in which all variables
other than manure that affect yield, like water, quality of soil, quality of seed, use of
insecticides and so on, need to be controlled so as to evaluate the effect of different levels
of manure on the yield. Other methods of conducting the experiment to achieve the same
26
objective without controlling “all other factors” also exist. Two branches of statistics -
The Design and Analysis of Experiments and Analysis of Variance - deal with these.
How do we collect data? We translate the data requirements of the proposed Study into
items of information to be collected from the respondent units to be covered by the study
and organise the items into a logical format. Such a format, setting out the items of
information to be collected from the respondent units, is called the questionnaire or
schedule of the study. The questionnaire has a set of pre-specified questions and the
replies to these are recorded either by the respondents themselves or by the investigators.
The questionnaire approach assumes that the respondent is capable of understanding and
answering the questions all by himself/herself, as the investigator is not supposed, in this
approach, to influence the response in any manner by interpreting the terms used in the
questions. Respondent-bias will have to be minimised by keeping the questions simple
and direct. Often the responses are sought in the form of “yes”, “no” or “can’t say” or the
judgment of the respondent with reference to the perceived quality of a service is graded,
like, “good”, “satisfactory” or “unsatisfactory”.
In the schedule approach on the other hand, the questions are detailed. The exact form of
the question to be asked of the respondent is not given to the respondent and the task of
asking and eliciting the information required in the schedule is left to the investigator.
Backed by his training and the instructions given to him, the investigator uses his
ingenuity in explaining the concepts and definitions to respondents to obtain reliable
information. This does not mean that investigator-bias is more in the schedule approach
than in the questionnaire approach. Intensive training of investigators is necessary to
ensure that such a bias does not affect the responses from respondents.
Schedules and questionnaires are used for collecting data in a number of ways. Data may
be collected by personally contacting the respondents of the survey. Interviews can also
be conducted over the telephone and the responses of the respondent recorded by the
investigator. The advent of modern electronic and telecommunications technology
enables interviews being done through e mails or by ‘chatting’ over the internet. The mail
method is one where (usually) questionnaires are mailed to the respondents of the survey
and replies received by mail through (postage pre-paid) business-reply envelopes. The
respondents can also be asked (usually by radio or television channels or even print
media) to send their replies by SMS to a mobile telephone number or to an e-mail
address.
central agency like the Directorate General of Commercial Intelligence and Statistics
(DGCI&S) for consolidation.
The above methods enable us to collect primary data, that is, data being collected afresh
by the agency conducting the enquiry or study. . The agency concerned can also make
use of data on the subject already collected by another agency or other agencies –
secondary data. Secondary data are published by several agencies, mostly Government
agencies, at regular intervals. These can be collected from the publications / compact
discs or the websites of the agencies concerned. But such data have to be examined
carefully to see whether these are suitable or not for the study at hand before deciding to
collect new data.
Errors in data constitute an important area of concern to data users. Errors can arise due
to confining data collection to a sample. (sampling errors). It can be due to faulty
measurement arising out of lack of clarity about what is to be measured and how it is
measured. Even when these are clear, errors can creep in due to inaccurate measurement.
Investigator bias also leads to errors in data. Failure to collect data from respondent units
of the population or the sample due to omission by the investigator or due to non-
response (respondents not furnishing the required information) also results in errors.
(non-sampling errors). The total survey error made up of these two types of errors need
to be minimised to ensure quality of data.
We have looked at methods and tools of data collection, chief among which is the sample
survey. How to select a sample for the survey to be conducted? There are a number of
methods of choosing a sample from a universe. These consist of two categories, random
sampling and non-random sampling. Let us turn these methods and see how well the
results from the sample can be utilised to draw conclusions about the parent universe.
1
The sample units are being referred to as ui (i = 1,2,…..n) and not in terms of Ui as we do not know
which of the population units have got included in the sample. Each ui in the sample is some population
unit
28
population and let the value of the ith sample unit be yi (i = 1,2,…..n)2. In other words, yi
(i = 1,2,….n) are the sample observations. A function of the sample observations is
referred to as a statistic. The sample mean ‘m’ given by (1/n)∑iyi , ∑i (i =1 to n), is an
example of a statistic.
Let us note the formulae for some important parameters and statistics.
2
The same reasons apply for referring the sample values or observations as yi (i = 1,2,…n) and not in terms
of the population values Yi . yi wil be some Yi.
29
The estimate ‘m1’ of the population parameter ‘μ’, computed from a sample, will most
likely be different from ‘μ’. There is thus an error in using ‘m1’ as an estimate of ‘μ’.
This error is the sampling error, assuming that all measurement errors, biases etc., are
absent, that is, there are no non-sampling errors. Let us draw another sample from the
population and compute the estimate ‘m2‘ of ‘μ’. ‘m2‘ may be different from ‘m1’ and
also from ‘μ’. Supposing we generate in this manner a number of estimates mi (i =
1,2,3,…….) of ‘μ’ by drawing repeated samples from the population. All these mi (i =
1,2,3,….) would be different from each other and from ‘μ’. What is the extent of the
variability in the mi (i = 1,2,3,….), or, the variability of the error in the estimate of ‘μ’
computed from different samples? How will these values be spread or scattered around
the value of ‘μ’ or the errors be scattered around zero? What can we say about the
estimate of the parameter obtained from the specific sample that we have drawn from the
population as a means of measuring the parameter, without actually drawing repeated
samples? How well do non-random and random samples answer these questions? The
answers to these questions are important from the point of view of inference.
Let us first look at the different methods of non-random sampling and then move on to
random sampling.
Read Sections 3.1 to 3.3, pp. 66 – 77 and Section 3.9, pp. 97 – 107, Chapter 3, Richard I
Levin and David S Rubin (1991).
There are several kinds of non-random sampling. A judgment sample is a sample that
has been selected by making use of one’s expert knowledge of the population or the
universe under consideration. It can be useful in some circumstances. An auditor for
example could decide, on the basis of his experience, on what kind of transactions of an
institution he would examine so as to draw conclusions about the quality of financial
management of an institution. Convenience Sampling is used in exploratory research to
get a broad idea of the characteristic under investigation. An example is one that
consists of some of those coming out of a movie theatre; and these persons may be asked
to give their opinion of the movie they had just seen. Another example is one consisting
of those passers by in a shopping mall whom the investigator is able to meet. They may
be asked to give their opinion on a certain television programme. The point here is the
convenience of the researcher in choosing the sample. Purposive Sampling is much
similar to judgement sampling and is also made use of in preliminary research. Such a
sample is one that is made up of a group of people specially picked up for a given
purpose. In Quota Sampling, subgroups or strata of the universe (and their shares in the
universe) are identified. A convenience or a judgement sample is then selected from each
stratum. No effort is made in these types of sampling to contact members of the universe
who are difficult to reach. In Heterogeneity Sampling units are chosen to include all
opinions or views. Snowball Sampling is used when dealing with a rare characteristic. In
such cases, contacting respondent units would be difficult and costly. This method relies
on referrals from initial respondents to generate additional respondents. This technique
enables one to access social groups that are relatively invisible and vulnerable. This
method can lower search costs substantially but this saving in cost is at the expense of the
representative character of the sample. An example of this method of sampling is to find
30
a rare genetic trait in a person and to start tracing his lineage to understand the origin,
inheritance and etiology of the disease.
It would be evident from the description of the methods given above that the relationship
between the sample and the parent universe not clear. The selection of specific units for
inclusion in the sample seem to be subjective and discretionary in nature and, therefore,
may well reflect the researcher’s or the investigator’s attitudes and bias with reference to
the subject of the enquiry.
A sample has to be representative of the population from which it has been selected, if it
is to be useful in arriving at conclusions about the parent population. A representative
sample is one that contains the relevant characteristics of the population in the same
proportion as in the population. Seen from this angle, the non-random sampling methods
described above do not yield representative samples. Such samples are, therefore, not
helpful in drawing valid conclusions about the parent population and the way these
conclusions change when another sample is chosen from the population. Non-random
sampling is, however, useful in certain circumstances. For instance, it is an inexpensive
and quick way to get a preliminary idea of the variable under study or a rough
preliminary estimate of the characteristics of the universe that helps us to design a
scientific enquiry into the problem later. It is thus useful in exploratory research.
------------------------------------------------------------------------------------------------------------
-
Read
Sections “Non-Probability Sampling” and “Other Sampling Designs”, Chapter 5, Royce
A. Singleton (2005) pp.132 – 138.
Section on “Non-Probability Sampling”, Chapter 4, Kultar Singh (2007), pp. 107 – 108.
------------------------------------------------------------------------------------------------------------
Random sampling methods, on the other hand, yield samples that are representative of
the parent universe. The selection process in random sampling is free form the bias of the
individuals involved in drawing the sample as the units of the population are selected at
random for inclusion in the sample. Random sampling is a method of sampling in which
each unit in the population has a predetermined chance (probability) of being included in
the sample. A sampling design is a clear specification of all possible samples of a given
type with their corresponding probabilities. This property of random sampling helps us to
answer the questions we raised at the end of sub-section 2.6.1 above. That is, we can
make estimates of the characteristics of the parent population from the results of a sample
and also indicate the extent of error to which such estimates are subject or the precision
of the estimate. This is better than not knowing anything at all about the magnitude of the
error in our statements regarding the parent population. Let us see how random sampling
helps in this regard.
We noted earlier (the last paragraph of subsection 2.6.1) that the sample mean (an
estimate of the population mean ‘μ’) will have different values in repeated samples drawn
from the population and none of these may be equal to ‘μ’. Suppose that the repeated
31
samples drawn from the population are random samples. The sample mean computed
from a random sample is a random variable. So is the sampling error, that is, the
difference between‘μ’ and the sample mean. The values of the sample means (and the
corresponding errors in the estimate of ‘μ’) computed from the repeated random samples
drawn from the population are the values assumed by this random variable with
probabilities associated with drawing the corresponding samples. These will trace out a
frequency distribution that will approach a probability distribution when the number of
random samples drawn increases indefinitely. The probability distribution of sample
means computed from all possible random samples from the population is called the
sampling distribution of the sample mean. The sampling distribution of the sample mean
has a mean and a standard deviation. The sample mean is said to be an unbiased
estimator of the population mean if the mean of the sampling distribution of the sample
mean is equal to the mean of the parent population, say, μ. In general, an estimator “t”
of a population parameter “θ” is an unbiased estimator of “θ” if the mean of the
sampling distribution of “t”, or the expected value of the random variable “t”, is equal
to “θ”. In other words, the mean of the estimates of the parameter made from all possible
samples drawn from the population will be equal to the value of the parameter.
Otherwise, it is said to be a biased estimate. Supposing the mean of the sampling
distribution of sample mean is Kμ or K+μ, where K is a constant. The bias in the
estimate can be easily corrected in such cases by adopting m/K or (m – K) as the
estimator of the population mean.
The variance of the sampling distribution of the sample mean is called the sampling
variance of the sample mean. The standard deviation of the sampling distribution of
sample means is called the standard error (SE) of the sample mean. It is also called the
standard error of the estimator (of the population mean), as the sample mean is an
estimator of the population mean. The standard error of the sample mean is a measure of
the variability of the sample mean about the population mean or a measure of the
precision of the sample mean as an estimator of the population mean. The ratio of the
standard deviation of the sampling distribution of sample means and the mean of the
sampling distribution is called the coefficient of variation (CV) of the sample mean or
the relative standard error (RSE) of the sample mean. That is,
CV (or RSE) is a free number or is dimension-less, while the mean and the standard
deviation are in the same units as the variable ‘y’. (These definitions can easily be
generalised to the sampling distribution of any sample statistic and its SE and RSE.)
We have talked about the unbiasedness and precision of the estimate made from the
sample. What more can we say about the precision of the estimate and other
characteristics of the estimate? This is possible if we know the nature of the sampling
distribution of the estimate.
The nature of the sampling distribution of, say, the sample mean, or for that matter any
statistic, depends on the nature of the population from which the random sample is
drawn. If the parent population has a normal distribution with mean μ and variance σ2 or,
in short notation, N (μ, σ2), the sampling distribution of the sample mean, based on a
random sample drawn from this, is N (μ, σ2/n). In other words, the variability of the
32
sample mean is much smaller than that of the variable of the population and it also
decreases as the sample size increases. Thus, the precision of the sample mean as an
estimate of the population mean increases as the sample size increases.
As we know, the normal distribution N (μ, σ2) has the following properties:
(i) Approximately 68% of all the values in a normally distributed population lie
within a distance of one standard deviation (plus and minus) from the mean,
(ii) Approximately 95% of all the values in a normally distributed population lie
within a distance of 1.96 standard deviation (plus and minus) of the mean,
(iii) Approximately 99% of all the values in a normally distributed population lie
within a distance of 2.576 standard deviation (plus and minus) of the mean.
The statement at (iii) above, for instance, is equivalent to saying that the population mean
μ will lie between the observed values (y – 2.576 σ) and (y + 2.576 σ) in 99% of the
random samples drawn from the population N(μ, σ2). Applying this to the sampling
distribution of the sample mean, which is N(μ, σ2/n), we can say that
or that the population mean μ will lie between the limits computed from the sample,
namely, (m – 2.576 σ/√n) and (m+ 2.576 σ√n) in 99% of the samples drawn from the
population. This is an interval estimate, or a confidence interval, for the parameter with
a confidence coefficient of 99% derived from the sample.
The general rule for constructing a confidence interval of the population mean with a
confidence coefficient of 99% is: the lower limit of the confidence interval is given by
the “estimate of the population mean minus 2.576 times the standard error of the
estimate” and the upper limit of the interval by the “estimate plus 2.576 times the
standard error of the estimate”. (2.16)
If the parent population is distributed as N (M, σ2) and σ2 is not known, we make use of
an estimate of σ2. The statistic ‘s2 ‘ given in formula 2.6 can be one such, but this is not
an unbiased estimate of σ2 as E (s2) = [(n – 1)/ n] σ2. We, therefore, by using (2.6) & (2.7)
have:
v(y) = [1/(n – 1)] [ss]2 or, v(y) = [(1/(n - 1)][ ∑i yi 2 – nm2 ] (2.18)
As the sampling variance of the sample mean ‘m’ is σ2/n, an unbiased estimate v(m) of
the sampling variance will be v(y)/n. Let us now consider the statistic defined by the
ratio,
The numerator is a random variable distributed as N(0, σ2/n) and the denominator is the
square root of the unbiased estimate of its variance. The sampling distribution of the
statistic ‘t’ is the Student’s t-distribution with (n – 1) degrees of freedom. It is a
symmetric distribution. A confidence interval can now be constructed for the population
mean M from the selected random sample, say with a confidence coefficient of (1 - α )%.
The values of ‘tα ‘ for different values of α = Pr.[t > tα ] + Pr.[(- t) < (- tα)] = 2 Pr.[ t > tα
] and different degrees of freedom have been tabulated in, for instance, Rao, C.R. and
Others (1966). The confidence interval with a confidence coefficient (1 - α) for the
population mean M would be as in 2.19 below – easily computed from the sample
observations.
We note that the rule 2.16 applies here also except that we use (i) the square root of the
unbiased estimate of the sampling variance of the estimate of the population mean in
the place of the standard error of the estimate of the population mean, and (ii) the
relevant value of the ‘t’ distribution instead of the normal distribution ……… (2.21)
We have so far dealt with parent populations that are normally distributed. What will be
the nature of the sampling distribution of the sample mean when the parent population is
not normally distributed? We examine this question in the next subsection C.
The Central Limit Theorem ensures that, even if the population distribution is not normal,
♦ the sampling distribution of the sample mean will have a mean equal to the
population mean regardless of the sample size and
♦ as the sample size increases, the sampling distribution of the sample mean
approaches the normal distribution.
Thus for large ‘n’ (sample size), say 30 or more, we can proceed with the steps
mentioned in sub-section A above. Further, the Student’s t-distribution also approaches
the normal distribution as ‘n’ becomes large so that we can use the statistic ‘t’ in sub-
section B as a normally distributed variable with mean 0 and unit variance for samples
of size 30 or more. We may then adopt the procedure outlined in sub-section A.
Read
(1) Sections 2.6 to 2., pp. 38 – 45, Chapter 2 and Sections 3.9a to 3.9d, pp.81 – 84,
Chapter 3, M.N.Murthy (1967) ,
(2) Section 3.10, Chapter 3, pp. 108 –110, Section 5.6, pp. 219 – 230, Chapter 5,
Sections 6.3 and 6.4, pp. 266 - 278, Chapter 6 and Sections 7.1, pp. 300 - 304 and
Sections 7.3 to 7.7, pp. 307 - 325, Chapter 7, Richard Levin & David S. Rubin
(1991) and
(3) Chapter s 6 & 7, pp. 113 – 151, P.K.Viswanathan (2007).
Random sampling methods also help in determining the sample size that is required to
attain a desired level of precision. This is possible because the standard error and the
coefficient of variation C.V. of the estimate, say, sample mean ‘m’, are functions of ‘n’,
the sample size. C.V. is usually very stable over the years and its value available from
past data can be used for determining the sample size. We can specify the value of C.V.
of the sample mean that we desire as, say, C(m) and calculate the sample size with the
help of prior knowledge of the population C.V., namely, C. That is,
Or we can define the desired precision in terms of the error that we can tolerate in our
estimate of ‘M’ (permissible error) and link it with the desired value of C(m). Then,
If the sanctioned budget is F for the survey: Let the cost function be of the form F0 +
F1n, consisting of two components – overhead cost and cost per unit to be surveyed. As
this is fixed as F, F = F0 + F1n, and the sample size becomes n = (F – F0 )/F1. The
coefficient of variation of C(m) is not at our choice in this situation since it gets fixed
once ‘n’ is determined. We can, however, determine the error in the estimate of ‘m’ from
this sample (in terms of the RSE of m), if the population CV, C is known. If further we
suppose that the loss in terms of money is proportional to the value of RSE of m, say, Rs.
‘l’ per 1% of RSE of ‘m’, the total cost of the survey becomes, L(n) = F0 + F1n + l (C/
√n). We can then determine the sample size that minimises this new cost (which
includes the cost arising out of loss). Differentiating L(n) w.r.t n and equating to zero
and simplifying,
Read Section 1.6, Chapter 1, pp.13 – 16; Section 2.13, Chapter 2, p. 48; and Sections 4.2
to 4.9, pp. 19 Chapter 4, M.N.Murthy (1967), pp.96 – 123.
We have so far dealt with random samples drawn form a population. We did not specify
the size of the population. We had assumed that the population is infinite in size. In
practice, a population may have a size N, however, large. Let us, therefore, consider
drawing random samples of size ‘n’ from a population of size ‘N’. We shall consider the
following methods of random sampling:
We shall indicate in the following sections a description of the above methods, the
relevant operational procedure for drawing a sample and the expressions/formulae for (a)
the estimator of the population mean/total/proportion, (b) the sampling variance of the
sample mean/total/population and (c) unbiased estimate of the sampling variance.
The method: This method of drawing samples at random ensures that (i) each item in the
population has an equal chance of being included in the sample and (ii) each possible
sample has an equal probability of getting selected. Let us select a sample of ‘n’ units
from a population of ‘N’ units by simple random sampling with replacement (SRSWR).
We select the first unit at random, note its identity particulars for collection of data and
place it back in the population. We choose at random another unit – this could turn out to
be the same unit selected earlier or a different one, note its identity particulars and place
it back. We repeat this process ‘n’ times to get an SRSWR sample of size ‘n’. In such a
sample one or more units may occur more than once. A sample of ‘n’ distinct units is also
possible. It can be shown that the number of possible samples that can be selected by
SRSWR method is N n and that the probability of any one sample being chosen is 1/ N n.
Note: If a unit gets selected in the sample more than once, the corresponding value of yi
will also have to be repeated as many times in the summation for calculating msrswr .
Note that the sampling variance, SE and CV(RSE) of the sample mean in SRSWR is
much less than SE and CV of the variable y and these decrease as the sample size
increases. The precision of the sample mean in SRSWR, as an estimator of M increases
as the sample size increases. However, the extent of decrease in the standard error will
not be commensurate with the size of the increase in the sample size. We would need an
unbiased estimator of σ2, as σ2 may not be known. This is
Confidence intervals for the population mean/proportion and the sample size for a given
level of precision and/or permissible error can now be derived easily.
Read Sections 3.1 to 3.4, pp. 55 – 66 and Sections 3.7 to 3.8a, pp. 76 – 79, Chapter3,
ibid.
The Method: This method of sampling is the same as SRSWR but for one difference. If
a unit is selected, it is not placed back before the next one is selected. This means that no
unit gets repeated in a sample. Operationally, we draw random numbers between 1 and N
and if a random number comes up again, it is rejected and another random number is
selected. This process is repeated till ‘n’ distinct units are selected. It can be shown that
the number of samples of size n that may be selected from a population of ‘N’ units by
this method is NCn = N ! /[ (N – n) ! n ! ] =[N(N –1) (N – 2) …(N – n + 1)] / [n(n – 1) (n
– 2)…….1]. The probability Psrswor(S) of any one of the samples being chosen is, 1/ NCn .
Both msrswor and msrswr are unbiased estimators of M but msrswor is a more efficient
estimator of M than msrswr . The factor [(N – n)/(N – 1)] in (2.40) is called the finite
population correction or finite population multiplier. The finite population correction
required for finite population need not, however, be used when the sampling fraction (n /
N) is less than 0.05.
the sample proportion ‘p’ is an unbiased estimate of ‘P’ in SRSWOR also. (2.46)
Read Sections 3.5 to 3.7, pp. 67 – 78 and Sections 3.8 b & c, pp.80 – 81, Chapter 3, ibid.
Suppose a sample is selected in the form of two or more sub-samples drawn according to
the same sampling method so that each such sub-sample provides a valid estimate of the
population parameter. The sub-samples drawn in this way are called interpenetrating
sub-samples (I-PSS). This is operationally convenient, as the different sub-samples could
be allotted to different investigators. The sub-samples need not be independently
selected. There is, however, an important advantage in selecting independent
interpenetrating sub-samples. It is then possible to easily arrive at an unbiased estimate
of the variance of the estimator even in cases where the sampling method/design is
complex and the formula for the variance of the estimator is complicated.
If the unbiased estimator ‘t’ of the parameter θ is symmetrically distributed (for example,
normally distributed), the probability of the parameter θ lying between the maximum and
the minimum of the ‘h’ estimates of θ obtained from the ‘h’ sub-samples is given by:
Prob.[Min of {t1, t2 ,---- t h } < θ < Max of {t1, t2,-----t h}] = [1 – (1/2)( h – 1) ] (2.52)
This is a confidence interval for θ from the sample. The probability increases rapidly with
the number of I-P sub-samples – from 0.5 (two sub-samples) to 0.875 (four sub-samples).
The IPSS technique is also useful in assessing non-sampling errors. (see Box below.)
The Method: Let {Ui}, i = 1,2,………. N be the units in a population. Let ‘n’ be the size
of the sample to be selected. Let ‘k’ be the integer nearest to N/n - denoted usually as
[N/n] - the reciprocal of the sampling fraction. Let us choose a random number from 1 to
k, say, ‘r’. We then choose the rth unit, that is, Ur . Thereafter, we select every kth unit. In
other words, we select the units Ur, Ur+k , Ur+2k ,………… This method of sampling is
called systematic sampling with a random start. ‘r’ is known as the random start and ‘k’
the sampling interval. There would thus be ‘k’ possible systematic samples, each
corresponding to one random start from 1 to k. The sample corresponding to the random
start ‘r’ will be
The sample size of all the ‘k’ systematic samples will be ‘n’ if N = nk. All the ‘k’
systematic samples will not have a sample size ‘n’ if N ≠ nk. For example, if we have a
sample of 100 units and we wish to select systematic samples of size 14, the sampling
interval is k = [100/15] or 7 The samples with the random starting 1 and 2 will be of size
15 while the other 5 systematic samples (with random starts 3 to 7) will be of size 14.
msys* = (k/N) ∑i yi , ∑i , i =1 to n*, n* is the size of the selected sample and k the
sampling interval v(2.53)
If N = nk, msys* = m the sample mean. If N ≠ nk, there is a bias in using the sample mean
as the estimator for M, and
the bias in using the sample mean as an estimator of M is likely to be small in the case of
systematic samples selected from a large population. (2.54)
Taking the earlier example of selecting a systematic sample of size 14 from a population
of 100 units (N == 100, k = 7 and n = 15) all the samples can be made to have a size of
15 by adopting the CSS. A random start of 5 will lead to the selection of a sample of the
15 units 5,12,19,26,33,40,47,54,61,68.75,82,89,96 and 3 (96 + 7 – 100). This procedure
ensures equal probability of selection to every unit in the population.
Besides constancy of the sample size from sample to sample, the CSS procedure
ensures that mr the sample mean is an unbiased estimate of the population mean.
(2.55)
Let nk = N. Then m* = m. There are k possible samples, each sample with a probability
of 1/k. Let the sample mean of the r-th systematic sample be mr = (1/n)∑i yir, where yir is
the value of the characteristic under study for the i-th unit in the r-th systematic sample,
summation is from i = 1 to n. As already noted mi is an unbiased estimator of M or E(mr)
= M. We thus have k possible unbiased estimates of M. Denoting the sample mean in
systematic sampling as msys, the sampling variance of msys, and related results of interest
are:
Equation 2.57 shows that (i) V(msys) is less than the variance of the variable under study
or the population variance, since σw2 is > 0 and (ii) V(msys) can be reduced by increasing
σw2, or by increasing the within-sample variance. (ii) would happen if the units within
40
Aspects of systematic sampling listed below are important. Find out about these from the
suggested readings (Box below).
Read
Sections 5.1 and 5.2, pp. 133 – 141, Sections 5.4 to 5.10, pp. 142 – 171, Chapter 5, ibid.
The Sampling Method: We have so far considered sampling methods in which the
probability of each unit in the population getting selected in the sample was equal. There
are also methods of sampling in which the probability of any unit in the population
getting included in the sample varies from unit to unit. One such method is sampling with
probability proportional to size (pps) in which the probability of selection of a unit is
proportional to a given measure of its size. This measure may be a characteristic related
to the variable under study. One example may be the employment size of a factory in the
past year and the variable under study may be the current year’s output. Does this method
lead to a bias in our results, as units with smaller sizes would be under represented in the
sample and those with larger sizes would be over represented. It is true that if the sample
mean ‘m’ were to be used to estimate the population mean M, m would be a biased
estimator of M. However, what is done in this method of sampling is to weight the
sample observations with suitable weights at the estimation stage to obtain unbiased
estimates of population parameters, the weights being the probabilities of selection of
the units.
Estimates from pps sample of size 1: Let the population units be {U1,U2, -------UN}.Let
the main variable Y and the related size variable X associated with these units be{Y1, X1;
Y2, X2; …………YN, XN}. The probability of selecting any unit, say, Ui in the sample
will be Pi = (X i / X), where ∑iX i = X, where ∑i , i = 1 to N. Let us select one unit by pps
method. Let the unit selected thus have the values y1 and x1 for the variables y and x. The
variables y and x are random variables assuming values Yi and X i respectively with
probabilities Pi , i = 1,2,…………N. The following results based on the sample of size 1
can be derived easily:
41
These show that the variance of the estimate will be small if the Pi are proportional to Yi.
Estimators from pps sample of size > 1 [pps with replacement (pps-wr)]
A sample of n ( > 1) units with pps can be drawn with or without replacement. Let us
consider a pps-wr sample. Let {yi , pi} be respectively the sample observation on the
selected unit and the initial probability of selection at the i–th draw, i = 1,2,----n. Each (yi
/ pi) , i = 1,2,----- n in the sample is an unbiased estimate [Yi(pps-wr)*] of the population
total Y and V(Yi(pps-wr)*) = ∑i (Yr 2 /Pr) – Y2, ∑r , r = 1 to n. (see 2.60). Estimates from
pps-wr samples are:
(1) Cumulate the sizes of the units to arrive at the cumulative totals of the unit sizes.
Thus
Ti - 1 = X1 + X2 + -----+ Xi –1 ; Ti = X1 + X2 +---+ Xi - 1 + Xi = Ti - 1 + Xi ; i =
1,2,……..., N.
(3) Choose the unit Ui if R lies between Ti - 1 and Ti, that is, if Ti - 1 < R ≤ Ti . The
probability P(Ui) of selecting the i-th unit will thus be P(Ui) = (Ti - Ti – 1) /TN = Xi
/ X = Pi
(4) Repeat the operation ‘n’ times for selecting a sample of size n with pps-wr.
42
There are other sampling methods under pps like pps without replacement and pps
systematic sampling. (Readings – Box below)
------------------------------------------------------------------------------------------------------------
Read Sections 6.1 to 6.4, pp. 183 – 197 and Section 6.10, pp. 200 – 202; Chapter 6,
Section 6.10a to 6.10c, pp. 201 – 208; Section 6.11 a to c, pp. 209 – 215, ibid.
The Method: We might sometimes find it useful to classify the universe into a number
of groups and treat each of these groups as a separate universe for purposes of sampling.
Each of these groups is called a stratum and the process of grouping stratification.
Estimates obtained from each stratum can then be combined to arrive at estimates for the
entire universe. This method is very useful as (i) it gives estimates not only for the whole
universe but also for the sub-universes and (ii) it affords the choice of different sampling
methods for different strata as appropriate. It is particularly useful when a survey
organistion has regional field offices. This method is called Stratified Sampling.
Let us divide the population (universe) of N units into k strata. Let Ns be the number of
units in the s-th stratum. Ysi be the value of the i-th unit in the s-th stratum. Let the
population mean of the s-th stratum be Ms. Ms = (1/Ns)∑Ysi, ∑i , i = 1,2,----,Ns (that is
over the units within the s-th stratum) and the population M is =(1/N)∑sNs Ms = ∑sWs Ms
, where Ws = (Ns / N) and ∑s being over the strata s = 1,2,……,k). Suppose that we select
random samples from each stratum and the sampling method for different strata are
different. Let the unbiased estimate of the population mean Ms of the s-th stratum be ms.
Denoting ‘st’ for stratified sampling, an unbiased estimator of M is given by
Cov.(ms,mr) = 0 for s ⋅≠ r ; (samples from diff. strata are independently chosen) .. (2.68)
Thus estimators with smaller variance (efficient estimators) can be obtained in stratified
sampling if we form the strata in such a way as to minimise intra-strata or within-strata
variation, that is, variance within strata. This would mean maximising between-strata or
inter-strata variation, since the total variation is made up of within-strata and between-
strata variation. In other words, units in a stratum should be homogeneous.
The stratum sample size should, therefore, be proportional to Ws√(Vs / Fs). The minimum
variance with the ns so determined is,
Read Section 7.1, pp. 232 – 233, Section 7.2, pp. 235 – 236 and Section 7.4, pp. 239 –
243 (especially Section 7.4b, pp. 241 – 243), Chapter 7, ibid.
------------------------------------------------------------------------------------------------------------
Estimates from cluster sampling: Let us consider a population of NK units divided into
N mutually exclusive clusters of K units each – a case of clusters of equal size. The
44
population mean M and the cluster means are given respectively by M = (1/N)∑sms, ∑s
being over clusters s = 1 to N and ms = (1/K)∑i Ysi, ∑i being from i = 1 to K within the s-
th cluster. Let us draw a sample of one cluster by srs. The cluster mean mc-srs (the
subscript c-srs denotes cluster sampling with srs) is an unbiased estimate of M. The
sampling variance of the sample cluster mean is
Let us compare V(mc-srs) with the sampling variance of the sample mean when K units are
drawn from NK units by SRSWR method. How does the “sampling efficiency” of cluster
sampling compare with that of SRSWR?. The sampling efficiency of cluster sampling
compared to that of SRSWR, Ec/srswr , is defined as the ratio of the reciprocals of the
sampling variances of the unbiased estimators of the population mean obtained from
the two sampling methods. The sampling variances and sampling efficiency are
Thus, cluster sampling is more efficient than SRSWR if the within-cluster variance is
larger than (K –1) times the between-cluster variance. Is this likely? This is not likely as
the between-cluster variance will usually be larger than the within-cluster variance due to
within-cluster homogeneity. Cluster sampling is in general less efficient than sampling
of individual units from the point of view of sampling variance. Sampling of individuals
could provide a better cross section of the population than a sample of clusters since units
in a cluster tend to be similar.
Read Sections 8.1, 8.2 and 8.2a, Chapter 8, ibid. pp. 293 – 297,
------------------------------------------------------------------------------------------------------------
We noted in the sub-section on cluster sampling that random sampling of units directly is
more efficient than random sampling of clusters of units. But cluster sampling is
operationally convenient. How to get over this dilemma? We may first select a random
sample of clusters of units and thereafter select a random sample of individual units from
the selected clusters. We are thus selecting a sample of units, but from selected clusters of
units. What we are attempting is a two-stage sampling. This can thus be a compromise
between the efficiency of direct sampling of units and the relatively less efficient
sampling of clusters of units. This type of sampling would be more efficient than cluster
sampling but less efficient than direct sampling of individual units. In the sampling
procedure now proposed, the clusters of units are the first stage units (fsu) or the
primary stage units (psu). The individual units constitute the second stage units (ssu) or
the ultimate stage units (usu).
45
This procedure of sampling can also be generalised to multi-stage sampling. Take for
instance a rural household survey. The fsu’s in such a survey may consist of districts, the
ssu’s may be the tehsils or taluks chosen from the districts selected in the first stage, the
third stage units could be the villages selected from the tehsils or taluks selected in the
second stage and the fourth and the ultimate stage units (usu’s) may be the households
selected from the villages selected in the third stage. Such multi-stage sampling
procedures help in utilising such information related to the variable under study as may
be available in choosing the sampling method appropriate at different stages of sampling.
In a multi-stage sampling, estimates of parameters are built up stage by stage. For
instance, in two-stage sampling, estimates of the sample aggregates relating to the fsu’s
are built up from the ssu’s using the sampling method adopted for selecting the ssu’s.
These estimates are then used with the sample probabilities of selection of fsu’s to build
up estimates of the relevant population parameters.
Read
Sections 9.1 and 9.2, Chapter 9, ibid. , pp. 317 – 322.
Chapter 3, NSSO (1997), pp. 12 – 15.
1. When we do not have any a priori information about the nature of the population
variable under study, SRSWR and SRSWOR would be appropriate. Both are
operationally simple. However, SRSWOR is to be preferred, since V(msrswor) <
V(msrswr). This advantage holds only when the sampling fraction is not small, or N
and n are not large.
2. Systematic sampling is operationally even simpler than SRSWR and SRSWOR, but it
should not be used for sampling from populations where periodic or cyclic
trends/variations exist, though this difficulty can be overcome if the period of the
cycle is known. V(msys) can be reduced if the units chosen in the sample are as
heterogeneous as possible. But this will call for a rearrangement of the population
units before sampling.
3. When additional information is available about the variable ‘y’ under study, say, on a
variable (size variable) ‘x’ related to ‘y’, the pps method should be preferred. The
sampling variance of Y* (or m) gets reduced when the probability of selection of
units Pi = ( Xi /N) are proportional to Yi , that is, the size Xi is proportional to Yi or
the variables x and y are linearly related to each other and the regression line passes
through the origin. In such cases pps is more efficient than SRSWR. Further, this
method can be utilised along with other sampling methods and their relative
efficiencies. pps is operationally simple. Pps-wor combines the efficiency of
SRSWOR and the efficiency-enhancing capacity of pps. However, most of the
procedures of selection available, estimators and their variance for pps-wor are
complicated and are not commonly used in practice. This is particularly so in large-
46
scale sample surveys with a small sampling fraction, as in such cases sampling
without replacement does not result in much gain in efficiency. Hence unless the
sample size is small, we should prefer pps-wr.
4. Stratified sampling comes in handy when we wish to get estimates at the level of sub-
populations or regions or groups. This method also gives us the freedom to choose
different sampling methods/designs in different strata as appropriate to the group
(stratum) of the population and the opportunity to utilise available additional
information relating to the stratum. The sampling variance of estimators can also be
brought down by forming the strata in such a way as to ensure homogeneity of units
within individual strata. In fact, the stratum sizes can be so chosen as to minimise the
variance of estimators, when there is a ceiling on the cost of the survey. Stratified
sampling with SRS, SRSWOR or pps-wr presents a set of efficient sampling
designs.
6. Finally, we can use the technique of independent I-PSS in conjunction with the
chosen sampling design to get at (i) an unbiased estimate of V(m) for any sampling
design or estimator of V(m), however complicated, (ii) a confidence interval for ‘M’
based only on the I-PSS estimates (when the population distribution is symmetrical)
and (iii) a tool for monitoring the quality of work of the field staff and agencies.
Read also
Sections 14.8 and 14.9, Chapter 14, M.N.Murthy (1967) pp.493 – 497.
------------------------------------------------------------------------------------------------------------
There are broadly three methods of collecting data. The array of tools used for data
collection by such methods has expanded over time with the advent of modern
technology. Confining data collection efforts to a sample from the population of interest
to the study, inevitably leads to questions like the use of random and non-random
samples. Judgment sampling, convenience sampling, purposive sampling, quota sampling
47
and snowball sampling all belong to the latter group. The absence of a clear relationship
between a non-random sample and the parent universe and the presence of the
researcher’s bias in the selection of the sample render such samples useless for drawing
valid conclusions about the parent population. But these methods are inexpensive and
quick ways of getting a preliminary idea of the universe for use in designing a detailed
enquiry and in exploratory research. Random samples, on the other hand, are free from
such drawbacks and have properties that help in arriving at valid conclusions about the
parent population.
The simplest of the sampling methods – SRSWR - ensures equal chance of selection to
every unit of the population and yields a sample in which one or more units may occur
more than once. ‘msrswr’ is an unbiased estimator of M. Its precision as an estimator of M
increases as the sample size increases. SRSWOR yields a sample of distinct units.
‘msrswor’ is also unbiased for ‘M’. SRSWOR is a more efficient than SRSWR as
V(msrswor) < V(msrswr). But this advantage disappears when the sampling fraction is small
(< 0.05). Both provide an unbiased estimator of V(m). An operationally convenient
procedure - interpenetrating sub-samples (I-PSS) – also provides an unbiased estimator
of V(m) for any sampling design and estimator for V(m), however complicated.
An example of methods where the probability of selection varies from unit to unit is pps.
The “size” could be the value of a variable related to the study variable. In pps, each yi /
pi , where yi is the value of the study variable associated with the selected unit and pi the
probability of selection of the unit, is an unbiased estimate (Y*) of the population total Y
and [(1/N) Y*] an unbiased estimator of M. As V(Y*) is small if the probabilities Pi are
roughly proportional to Yi , pps sampling is more efficient than SRS if the size variable
x is proportional to y, that is, x and y are linearly related and the regression line passes
through the origin. pps sampling can be done with SRSWR, SRSWOR or systematic
sampling. In pps-srswr, [(1/n) ∑i(yi / pi)], (∑i i = 1 to n), is an unbiased estimator of Y.
This being the mean of n independent unbiased estimates with the same variance V(Y*),
v(Y*) can be derived using the I-PSS technique.
Stratified Sampling is used when (i) estimates are needed for subgroups of a universe or
(ii) the subgroups could be treated as sub-universes. It gives us the freedom to choose the
sampling method as appropriate to each stratum. Estimates of parameters are available
for the sub-universes (strata) and these can then be combined over the strata to get
estimates for the entire universe. SE of estimates based on stratified sampling can be
small if we form the strata in such a way as to minimise intra-strata variance. Each
stratum should thus consist of homogeneous units, as far as possible. Stratum-wise
sample sizes can also so chosen as to minimise the variance of estimators.
Thus while non-random sampling methods are useful in exploratory research and
preliminary work on planning of enquiries, random sampling techniques lead to valid
judgments regarding the universe. Among random sampling methods, SRSWOR,
stratified sampling with SRSWOR and, when available information permits, pps-wr
and stratified sampling with pps-wr, turn out to be a set of the more efficient and
practically useful designs to choose from. I-PSS can also be used in these designs
where possible and necessary.
Current developments in sample survey theory and methods touch upon all the sub-areas
of the subject, namely, data collection and processing, survey design and estimation or
inference. Use of telephones for surveys, where tele-density is high, for selection of
samples and data collection (Random Digit Dialing), tackling non-response through
techniques like split-questionnaires, ordering of questions on sensitive information in the
questionnaire, application of artificial neural network for editing data and imputation,
total survey design approaches that tackle total survey error, the Dual Frame
Methodology to enable small area estimation that is so necessary for regional planning,
sampling on more than one occasion and related issues, replication methods to tackle
measurement error, post stratification (stratification after collection of data), use of
auxiliary information at the time of estimation, resampling methods like jacknife and
bootstrap, methods of estimation of complex functions of parameters like distribution
functions, quantiles, poverty proportion and ordinates of the Lorenz Curve, especially in
the presence of measurement error, and their variances and the related computer packages
are receiving increasing attention of researchers and survey practitioners. The following
references for further reading may be useful for appreciation of these developments.
Rao J.N.K. (1999): Some Current Trends in Sample Survey Theory and Methods,
(Special Issue on Sample Surveys), Vol. 61, Series B, Part 1, pp. 1 – 57.
Fuller W. A. & Jay Breidt F.(1999):Estimation for Supplemented Panels, ibid, pp.58 –
70.
Shao, Jun & Chen, Yinzhong (1999): Approximate Balanced Half Sample and Related
Replication Methods for Imputed Survey Data, ibid, pp. 197 – 201.
49
Pfeffermann, Danny & Tiller, Richard (2006): Small – Area Estimation with State –
Space Models Subject to Benchmark Constraints, Vol.101, No. 476, pp. 1387 – 1397.
Mach, Lenka; Reiss, Philip T. & Schiopu-Kratina, Ioana (2006): Optimizing the
Expected Overlap of Survey Samples via the North West corner Rule, Vol. 101, No. 476,
pp.1671 – 1679.
Qin, Jing; Shao, Jun & Zhang, Bio (2008): Efficient and Doubly Robust Imputation for
Covariate-Dependent Missing Responses, Vol. 103, No. 482, pp. 797 – 810.
Kim J.K. & Park H (2006): Imputation Using Response Probability, Vol. 34, No. 1, pp.
171 – 182.
Kim J.F. & Kim J.J. (2007): Non Response Weighting Adjustment Using Estimated
Response Probability, Vol. 35, No.4, pp. 501 – 514.
:
Books
1. You have been asked to conduct a study to determine the literacy rate in a district.
The choice before you is to adopt a census approach or a random sample survey. How
would you make a choice between the two? What considerations would lead you to a
choice?
2. What tools of data collection would you make use of in the above enquiry and why?
3. What is meant by a ‘parameter’? Define the term ‘statistic’. Give the expressions for
population mean, sample mean and population variance and sample standard
deviation.
4. What is the most important purpose of studying the population on the basis of a
sample? In this context, define the terms ‘estimator’ and ‘estimate’ with a suitable
example.
51
7. How does the random sampling procedure help in correcting for the bias of an
estimate? Illustrate this with the help of an example.
8. The sample proportion ‘p’ calculated from a random sample of size ‘n’ may be
considered as normally distributed with mean P and standard deviation √(PQ/n), when
n is sufficiently large. Construct a confidence interval for ‘P’ with a confidence
coefficient 0.99, when n is large.
10. A population has 80 units. The relevant variable has a population mean of 8.2 and a
variance of 4.41. Three SRSWR samples of size (i) 16, (ii) 25 and (iii) 49 are drawn
from the population. What is the standard error (SE) of the sample means in the three
samples? Is the extent of reduction in SE commensurate with that of the increase in
sample size?
11. What are the results when the sampling method in drawing the three samples in
problem 10 above is changed to SRSWOR? What is your advice regarding the choice
between increasing the sample size and changing the sampling method from srswr to
srswor?
12. Why is it said that SRSWOR is not a more efficient sampling design from the point of
view of the precision of the sample mean as an estimator of the population mean for
sampling fractions of less than 0.5?
13. Indicate whether the following statements are true (T) or false (F). If false, what is the
correct position?
(i) The standard error of the sample mean decreases in direct proportion to the
sample size.
(ii) SRSWOR method of sampling is more advantageous than srswr for a
sampling fraction of 0.02.
(iii) If Y* = Nm and Variance of m is V(m), the variance of Y* is NV(m).
14. What should be the size of the SRSWR sample to be selected if the coefficient of
variation of the sample mean should be 0.2? The population coefficient of variation is
known to be 0.8. What will be the sample size if we decide to adopt SRSWOR
sampling method?
15. Why would you recommend the use of the technique of interpenetrating sub-samples
in random sampling?
52
16. Four estimates of the population mean M obtained from four independent I-Pss
samples of equal size are, 20, 18, 23 and 28. Obtain an unbiased estimate of the
sampling variance of the sampling mean. Assuming that the population is normally
distributed, compute a confidence interval for M. What is the confidence coefficient
for this confidence interval? Do these results depend on the sample size/
17. A systematic sample of size 18 has to be selected from a population of 124. What
problems do you face in selecting the sample? Is the sample mean the unbiased
estimator of the population mean M? How do you overcome these problems?
18. Is the sample mean ‘m’ always an unbiased estimator of the population mean ‘M’ in
systematic sampling? If not when? What then is an unbiased estimator of M in cases
where ‘m’ is not an unbiased estimator of ‘M’? What is the bias in using the sample
mean ‘m’ as the estimator of ‘M’ in these situations? Show that this bias is likely to
be small for systematic samples from large populations.
19. What is the sampling variance of msys? what steps can be taken to reduce V(msys)?
20. When will systematic sampling be more efficient than (i) SRSWR ; (ii) SRSWOR ?
22. “It is not possible to get v(msys)” What are the reasons for this situation in the case of
systematic sampling? How is this problem overcome and how then can we get
v(msys)?
23. What are the situations in which systematic sampling should not be adopted? What
information is needed in such situations to use systematic sampling and how would
you use such information?
(a) pps and stratified sampling can be combined with other sampling methods.
(b) V(mst) is reduced by ensuring that units within individual strata are heterogeneous.
(c) The size of a stratified sample can be allocated among the strata in proportion to the
size of the strata, the size being the number of population units in a stratum.
27. We wish to study the wage levels of factory labour. What type of sampling method
would you adopt for the study and why if (a) just a list of factories is available with
the Chief Inspector of Factories of different State Governments, (b) if the list in (a)
above also gives the total number of employees in the individual factories at the end
of last year and (c) the list also indicates both the kind of product manufactured in the
factory along with the information specified in (b) above.
53
28. Show that cluster sampling is less efficient than direct sampling of individuals. Why
does this happen?
30. Is it correct to say that stratified sampling is a kind of multistage sampling? Why?
Structure
3.0 Objectives
3.1 Introduction
3.2 An overview of the Block
3.3 Important Steps involved in an Econometric Study
3.4 Two Variable Regression Model
3.4.1 Estimation of Parameters
3.4.2 Goodness of Fit
3.4.3 Functional Forms of Regression Model
3.4.4 Hypothesis Testing
3.5 Multi-Variable Regression Model
3.5.1 Regression Model with two explanatory variables
3.5.1.1 Estimation of Parameter: Ordinary Least Squares Approach
3.5.1.2 Variance and Standard Errors
3.5.1.3 Interpretation of Regression Coefficients
3.5.2 Goodness of Fit: Multiple Coefficient of Determination (R2)
3.5.3 Analysis of Variance
3.5.4 Inclusion and Exclusion of Explanatory Variables
3.5.5 Generalisation to N-Explanatory Variables
3.5.6 Problem of Multicollinearity
3.5.7 Problem of Heteroscedasticity
3.5.8 Problem of Autocorrelation
3.6 Further Suggested Readings
3.7 Model Questions
3.0 OBJECTIVES
• know the issue of linearity in regression model and appreciate its probabilistic
nature,
• estimate the unknown population regression parameters with the help of the
sample information. Explain the concept of goodness of fit and use the various
functional forms in the estimate of regression model,
3.1 INTRODUCTION
As a researcher, you may be tempted to examine whether the economic laws hold good in
the real world situation reflected in the displayed pattern of the relevant data. Similarly
as professional economist in government or private sector you may be interested in
estimating the demand or supply of various products and services or to know the effect of
various levels of advertisement expenditure on sales and profits. As macro economist,
you may like to measure and evaluate the impact of various government policies say
monetary and fiscal policies on important variables such as employment, unemployment,
income, imports and exports, interest rates, inflation rates etc. As stock market analyst,
you may seek to relate the prices of a stock to the characteristics of the company issuing
the stock and the overall state of the economy. Such types of issues are investigated by
employing various statistical techniques. Regression Modeling is one of the primary
statistical tools employed for conducting such type of research studies.
The basic steps followed in conducting a regression model based empirical study are: (1)
based on the knowledge of economic theory, past experience or other studies,
specification of a model, (2) gathering the data, (3) estimating the model, (4) subjecting
the model to hypothesis testing, and interpreting the results.
Specifying a Model:
In economics, like other physical sciences, the model (logical structure of system) is set
up in the form of equations, which precisely describe the behavior of economic and
related variables. The model may consist of a single equation or several equations. In
the specification of single equation, the behavior of a single variable (denoted by Y) is
explained. Placed on the left hand side, Y is referred to by a number of names like
dependent variable, regressand or explained variable. On the right hand side a number of
variables that influence the dependent variable are identified (denoted by Xs). These are
referred to as independent variables, exogenous variables, explanatory variables or
regressors. If a single equation model consists of one independent variable, it is referred
to as two variable regression model. In case, the investigator identifies more than one
56
In order to estimate the econometric model, data on dependent and independent variables
are needed. As we have studied in the last block (Block-2) of this course, data can be
collected by way of experimental, sample survey or observational method. If you aim to
explain the variation of the dependent variable over a period of time, you will be required
to obtain observations at different points of time (referred to as time series data). The
periodicity may be annual, quarterly, monthly or weekly depending on the need and
requirement. If we want to analyse the characteristics of a dependent variable at a given
point of time, we need the cross section data i.e. the observations for a variable for
different units at the same point of time. In pooled data, we have time series observations
for various cross sectional units. Here we combine the element of time series with that of
cross section.
After formulation of the model and gathering the data, the next step before us comes to
estimate the unknown parameters of the model like the intercept term α , the slope term β
etc. We shall discuss this issue in the next section.
We have already seen that the formulation of the basic model is guided by economic
theory, investigator’s perception of the underlying behavior and past experience or
studies. Consequently, the specified model is subjected to variety of diagnostic tests for
ascertaining whether the underlying assumptions and estimation methods are appropriate
for the data.
The final stage of the empirical investigation is to interpret the results. If the chosen
model does not refute the hypothesis or theory under consideration, we may use it to
predict the future value of the dependent variable Y on the basis of known or expected
future value of the explanatory variables.
(i) The first step is to specify the relationship between X and Y variable. Assuming
that there is a linear relationship between the two variables, we can specify the model as
Y = α + βX + U . By linearity, we often mean a relationship in which, the dependent
variable is a linear function of the independent variable. However, the linearity in the
57
context of regression analysis can be interpreted in two different ways: linearity in the
variable, and linearity in the parameters. Since the main purpose of regression analysis is
the estimation of its parameters, we shall consider only those models, which are linear in
parameter, no matter whether they are linear in variable or not. In fact, models that are
non-linear in variables but linear in parameters can be easily estimated by extending the
basic procedure that we are discussing here.
By the very nature of social science, the relationship among different variables cannot be
expected exact or deterministic. Hence the dependent variable Y tends to be probabilistic
or stochastic in nature. That is why; we specify the regression model by incorporating a
random or stochastic variable. The random variable U is also called the disturbance or
error term and is a sort of catch-all variable that represents all kinds of indeterminacies of
an inexact relationship.
Y= α + β X + U (3.1)
The above equation is called the population regression function. In this formulation, Y is
a stochastic or random, but X is non-stochastic or deterministic in nature. This
asymmetry in the treatment of the dependent and the independent variable can be
removed by making both Y and X stochastic in nature. However, such a model is beyond
the scope of the present discussion. It should be clear that a random variable like U is
introduced in the population regression function to incorporate the element of
randomness of a statistical relationship.
∧ ∧ ∧ ∧
Y = a+ β X + U (3.2)
∧ ∧ ∧ ∧
Here Y , α , β and U are interpreted as the sample estimates for their corresponding
unknown population counterparts. Thus, we hypothesize that corresponding to the linear
population function Y = α + βX + U , there is a linear sample regression function given by
Yˆ = αˆ + βˆX + Uˆ .
i) The disturbance term U has a zero mean for all the values of X i.e. E (U) = 0
ii) The variance is constant for all the values of X, i.e. V(X) = σ 2
iii) The disturbance term for two different values of X are independent i.e.
Cov (Ui, Uj) = 0 for i≠j
iv) X is non-stochastic
v) The model is linear
58
The philosophy behind the least square method is that we should fit in a straight line
through the scatter plot in such a manner that the vertical difference between the observed
values of Y and the corresponding values obtained from the straight line for the different
values, called errors, are minimum. In others words, we should choose our α and β in
such a manner that sum of the squares of the vertical differences between the actual
values or observed values of Y and the one obtained from the straight line is minimum.
The straight line that we obtain is called the line of best fit. Mathematically:
∧ ∧ ∧ ∧ ∧
Minimize ∑ U 2 = ∑(Y − Y ) 2 = E (Y − α − β X ) 2 with respect to α and β . By following
the usual minimization procedure, we obtain the so called normal equations. The two
normal equations are then simultaneously solved for
∧
β=
∑ xy (3.3)
∑x 2
and
∧
α = y − βˆ x (3.4)
∧ ∧
The least square estimators α and β are taken as the estimators of the unknown
population parameters α and β because they satisfy the following desirable properties.
(1) Least square estimators are linear
∧ ∧
(2) Least square estimators are unbiased i.e. E ( α ) = α and E( β ) = β .
(3) Among all the linear unbiased estimators, least square estimators have the
minimum variance and are therefore termed as efficient estimators.
All these properties of the least square estimators lead to Gauss-Markov theorem.
In other words, the least square estimators are the best linear unbiased estimators
i.e. BLUE
The standard errors of the estimates also known as standard deviations of the sampling
distributions of least square estimates are taken as a measure of the precision of these
estimates. These are obtained by taking the positive square root of the variances
∧ ∧
of α and β . The expressions for both the variance and standard error of the least square
estimators are given below:
59
∧
Var (α ) =
∑X 2
σ 2
(3.5)
∧
Se(α ) =
∑X 2
σ2 (3.6)
n∑ ( X − X 2
) n∑ ( X − X 2
)
∧ σ2 σ
Var ( β ) = (3.7) Se( βˆ ) = (3.8)
∑(X − X 2
) ∑(X − X 2 )
Here an unbiased estimator of σ is
∑ Uˆ 2
(3.9)
n−2
=
∑ (U − Uˆ ) 2
(3.10)
n−2 n−2
3.4.2 Goodness of Fit Once the regression line is fitted, we may be interested to
know how faithfully the sample regression line describes the unknown population
regression line. The regression error term or residual Û plays an important role in this
regard. Small quantities of residuals imply that a large proportion of variation in the
dependent variable has been explained by the regression equation and hence the fit is
good. Similarity, large quantities of residuals obviously point to the poor fit. The
coefficient of determination (The square of the correlation coefficient i.e. R2) acts as a
measure of goodness of fit.
Example: Given the following estimated regression model, interpret the results:
∧
Y = −14.0217 + 0.965217 X R 2 = 0.989345 (3.11)
S. Model Equation ⎛ dy ⎞ ⎛ dy x ⎞
Slope ⎜ = ⎟ Elasticity ⎜⎜ = . ⎟⎟
⎝ dx ⎠ ⎝ dx y ⎠
No.
1. Linear Y = β1 + β 2 X β2 ⎛X⎞
β2 ⎜ ⎟
⎝Y ⎠
2. Log-Linear Ln Y = β1 + β 2 ln X β 2 .Y X β2
3. Log-Lin Ln Y = β 1 + β 2 X β 2 .Y β 2 .X
4. Lin-log Y = β1 + β 2 ln X 1 β 2 . 1Y
β2.
X
5. Reciprocal 1 1 1
Y = β1 + β 2 . - β2. - β2.
X X2 XY
6. Log Reciprocal 1 Y 1
Ln Y = β 1 − β 2 . β2. β2.
X X2 X
Choice of Functional Form A great deal of skill and experience are required in
choosing an appropriate model for empirical estimation. However, following guidelines
can be helpful in this regard:
(i) The underlying theory may suggest a particular functional form.
(ii) The knowledge of above formula will be helpful to compare the various
models.
(iii) The coefficients of the model chosen should satisfy certain a priory
expectation.
(iv) One should not overemphasize the r2 measure in the sense that higher the
r2 , the better the model. The theoretical underpinnings of the chosen
model, the signs of the estimated coefficients and their statistical
significance are of more importance in this regard.
Read Basic Econometrics (Fourth Edition) by Damodar N. Gujrati and Sangeeta, Tata
Mcgraw Hill Publishing Company Ltd. Delhi (2007 edition), Chapter 3,5 and 6 (PP60-
108, 169-196)
regression model is the slope coefficient β . Hypothesis testing consists of three basic
steps:
(i) Formulating two opposing hypothesis:
H0 : β = 0
H1 : β ≠ 0
(ii) Deriving a test statistic and its statistical distribution under the null hypothesis
which is conventionally denoted by t. Thus
βˆ − E ( βˆ )
t=
s.e.( βˆ )
The t statistic obtained above has n-2 degrees of freedom because we are
(iii) Deriving a decision rule for rejecting or accepting a null hypothesis: The
following steps are involved in this process:
(a) H 0 : β = β 0 , H1 : β ≠ β 0
βˆ − βˆ 0
(b) the test statistic is t = and can be calculated from the sample
s.e.( βˆ )
information. Under the null hypothesis, it has the t distribution with n-2
degree of freedom. If the modulus of t is large, we would suspect that β
is probably not equal to β 0 .
(c) In the t table, trace the critical value of t for n-2 d.f. at the desired level of
significance (say a)
(d) Reject H 0 if t c > t ⊗ (t c = computed t valueand t ⊗ critical t value recorded
in the t table)
If t c < t ⊗ accept H 0 .
Example : Given the GDP at factor cost and final consumption expenditure for the
Indian economy during the period 1980-2001 at 1993-94 prices, we run
the regression model of the final consumption expenditure and got the
results through SPSS software as under:
FCE = α + β GDP
t = (17.35968) (91.50314)
R2 = 0.997617 d.f. = 20
62
H 0 : β = 0.80
Null hypothesis
H 1 : β ≠ 0.80
This computed value of t statistics (10.212841) exceeds the critical values of 2.845 and
2.086 for a degree of freedom of 20 at 1% and 5% levels of significance respectively.
Thus on the basis of sample information, the difference between estimated value of β
and its mean is so much that even in 1 out of 100 cases or 5 out of 100 cases we do not
obtain such a difference. Thus on the basis of the sample information, we are not in a
position to accept the null hypothesis. Hence in all probability during the sample period
of 1980-2001, India’s marginal propensity to consume has not been as high as 80%. In
this example, we considered two tailed test. Similarly one tailed test can also be
conducted. It all depends upon the type of enquiry that we intend to conduct.
The t test discussed above is an example of a small sample test. However, if the sample is
sufficiently large, then by virtue of central limit theorem, the distribution of the test
statistic discussed above approximately follows the standard normal distribution.
Accordingly, the entire test can be conducted by consulting the standard normal table
instead of the t table and in this case one need not bother about the degrees of freedom.
For deciding about whether a sample is sufficiently large or not one has to consider the
size of the sample (n). A rule of thumb is, if n is 30 or more, the sample can considered to
be a large sample, otherwise, it is to be taken as a small sample.
We shall extend the regression analysis further to make it more realistic and
comprehensive by
(i) introducing one more explanatory variable and re-examine the model,
(ii) interpreting the partial regression coefficients,
(iii) considering how many explanatory variables must be included in the
model and what should be the touch stone for arriving at such decision.
(iv) discussing the conditions or assumptions which make these extensions and
generalizations possible.
(v) examining the possible effects of violations of one or more assumptions,
particularly, multicollinearity, heteroscedasticity and autocorrelation
(The following matter has been adapted from Unit 10, Block 3 of MEC-009 course)
For the simplicity and better comprehension, we shall write the model Y= β 0 + β1 X 1
(3.1) in a different form to add more explanatory variables like X2, X3 with their
63
Using subscript t with Y, X1, X2 and U to denote the tth observation of these variables, the
above equation can be written as
Yt = β 0 + β 1 X 1t + β 2 X 2t + U t
We collect the sample observations on Y, X1 and X2 and write down the sample
regression function as Yt = b0 + b1 X 1t + X 2t + et …………… (3.14) where b0, b1 and
b2 replace the corresponding population parameters β 0 , β 2 and β 3 and random population
component Ut is replaced by sample error term et. By applying the principle of ordinary
least squares, we work out the values of b0, b1 and b2 such that residual sum of squares
(∑ et2 ) is minimum. Here
et = Yt − b0 − b1 X 1t − b2 X 2t (3.15)
∑e = ∑[Yt − b0 − b1 X 1t − b2 X 2t ]
2 2
t (3.16)
Differenting 3.16 w.r.t. b0, b1, b2 and equating to 0 gives us the three normal equations.
Yt = b0 + b1 X 1t + b2 X 2t (3.17)
∑Y X t 1t = b0 ∑ X 1t + b1 ∑ X + b2 ∑ X 1t X 2t
2
1t (3.18)
∑Y X t 2t = b0 ∑ X 2t + b1 ∑ X 1t X 2t + b2 ∑ X 22t (3.19)
These three equations give us the following expressions for b0, b1, and b2 respectively:
b0 = Y − b1 X 1 − b2 X 2 (3.20)
b1 =
(∑ yt x1t )(∑ x22t ) − (∑ yt x2t )(∑ x1t x2t ) (3.21)
(∑ x12t )(∑ x22t ) − (∑ x1t x2t )2
y t = ( yt − Y ), x1t = ( X 1t − X ), and x 2t = ( X 2t − X 2 )
64
SE (b0 ) = var(b0 )
∑ x22t
Var (b1 ) = .σ 2 (3.24)
∑ x1t ∑ x2t − (∑ x1t x2t )
2 2 2
SE (b1 ) = var(b1 )
∑ x12t
Var (b2 ) = .σ 2 (3.25)
∑ x1t ∑ x2t − (∑ x1t x2t )
2 2 2
SE (b2 ) = var(b2 )
where σ 2 is unknown and its unbiased OLS estimator σˆ 2 is obtained. Thus, σˆ 2 is worked out as
∑ et2
σˆ 2 = where n - 3 stands for degree of freedom (3.26)
n−3
Mathematically, b1 and b2 represent the partial slopes of regression plane with respect to
X1, and X2 respectively. In other words, b1 shows the rate of change in Y as X1 alone
undergoes a unit change, keeping all other things constant. Similarly, b2 represents rate
of change of Y as X2 alone changes by a unit while other things are held constant.
We have seen above that in case of single independent variable, r2 measures the goodness
of fit of the fitted sample regression line. When we have two explanatory variable X1,
and X2, we might be interested in the the proportion of total variation in
Y = ∑ yt2 explained by X1, and X2 jointly. This information is conveyed by multiple
coefficient determination, denoted by R2. It can be computed by the following formula.
b1 ∑ y t x1t + b2 ∑ y t x 2t
R2 = (3.27)
∑ y t2
R2 lies between 0 and 1 and closer it is to 1, better is the fit which implies that the
estimated regression line is capable of explaining greater proportion of variation in Y.
The positive square root of R2 is called coefficient of multiple correlation.
Where ESS = Explained sum of squares, and RSS is Residual sum of squares.
It should be noted that every sum of squares has some degree of freedom (df) associated
with it. Accordingly, in our 2-explanatory variable case, the degree of freedom will be
ESS / df
In such a case, we find that is the ratio of the variance explained by X1, and X2
RSS / df
to unexplained variance and it follows F distribution with 2 and n-3 degrees of freedom.
In general, if a regression equation estimates ‘K’ parameters including the intercept, then
F has (K-1) df in numerator and (n-k) df in the denominator.
F values can be expressed in terms of R2 as under:
R 2 /( K − 1)
F= (3.29)
(1 − R 2 ) /( n − k )
Interpretation : larger the variance explained by fitted regression line, larger the
numerator will be in relation to the denominator. Thus, a larger F value is evidence
against the truthfulness of H0: β 1 = β 2 = 0 . Thus in the case of an F value larger than 1,
one cannot accept the hypothesis that the variables X1, and X2, taken together, do not
have any effect on Y.
Read Basic econometrics (fourth edition) by Damodar N. Gujrati and Sangeeta Chapter
8 PP 253-265
As we add more and more explanatory variables Xs, the explained sum of squares (ESS)
keeps on rising and, consequently, R2 goes on rising. However, each additional variable
that is added eats up one degree of freedom and our definition of R2 makes no allowance
for this loss of degree of freedom. Thus, the philosophy of improving the goodness of fit
by sensibly increasing the number of explanatory variables may not be justified. We
know that TSS always has (n-1) degree of freedom. Therefore, comparing two regression
models with same dependent variable but different number of independent variables will
not be justified also. Hence we must adjust our measure of goodness of fit for degrees of
freedom. This measure is called adjusted R 2 , denoted by R 2 . It can be derived from R2
in the following manner:
66
n −1
R 2 = 1 − (1 − R 2 ) (3.30)
n−k
In general, our regression model may have a large number of independent variables. Each
of those variables can, on priory grounds, be expected to have some influence over the
‘dependent’ or ‘explained’ variable. Consider a very simple example. What can be
possible determinants of demand for potatoes in a vegetables market? One obvious
choice will be the price of potatoes. What else can affect the quantity demanded? Could it
be availability of vegetables which can be paired off with potatoes? In that case, prices of
a large number of vegetables which are cooked along with potatoes will become ‘relevant
explanatory variables’. You cannot ignore income of the community that patronizes the
particular market. Needless to say, the dietary preferences of members of the households
can also affect the demand and so on. In the next part, we shall discuss techniques which
help us restrict the analysis to a selected ‘few variables, though theoretical considerations
may find a huge number of them to be ‘useful’ and ‘powerful’ determinants. In fact, in
economic theory, we usually append the phrase ceteris paribus, with many a statements.
This phrase means keeping all other things constant. That means, we may focus on
impact of only a few selected variables on the dependent variable while assuming that all
other variables remain ‘unchanged’ during the period of analysis. However, before taking
recourse to this assumption; we have to weigh the need to include more and more
variables in our model with the ‘gains’ in explanatory power of the model. We have
developed, in previous section 3.54 a working touchstone for inclusion of more variables
in terms of improvement in R 2 and have tried to give it a ‘practical’ shape in the form of
the magnitude of ‘t’ values of the relevant slope parameters.
With these considerations in mind, we can generalise the linear regression model as
follows:
We hypothesize that in population, the dependent variable Y depends upon k explanatory
variables, X1, X2, ………..Xk. We also assume that the relationship is linear in
parameters. Three more assumptions are made and they have very significant bearing on
the analysis. These are:
a) Absence of multicollinearity;
c) Absence of autocorrelation
67
Y = Xβ+U (3.32)
⎡Y1 ⎤ ⎡β1 ⎤ ⎡U 1 ⎤
Where ⎢ ⎥ ⎢ ⎥
Y = ⎢# ⎥, β = ⎢# ⎥ and U = ⎢⎢# ⎥⎥
⎢⎣Yn ⎥⎦ ⎢⎣ β k ⎥⎦ ⎢⎣U n ⎥⎦
⎡ X 11 , X 21 , X 31 , """"" X k1 ⎤
⎢ X , X , X , """"" X ⎥
⎢ 12 22 32 k2 ⎥
also X = ⎢# ⎥
⎢ ⎥
⎢# ⎥
⎢ X 1n , X 2 n , X 3n , """"" X kn ⎥
⎣ ⎦
We assume that
(1) Expected values of error terms are equal to zero; that is E(ui) = 0 for all ‘i’. In matrix
notation
⎡0 ⎤
⎡E(u i ) ⎤ ⎢ ⎥
E (U ) = ⎢⎢# ⎥ = ⎢0 ⎥ = 0
⎥ ⎢# ⎥
⎢⎣ E (u n )⎥⎦ ⎢ ⎥
⎣0 ⎦
2) The error terms are not correlated with one another and they all have same variance for
σ 2 all sets values of the variables X. That is,
E (u i u j ) = 0; ∀i ≠ j and
E (u i2 ) = σ 2 ; ∀i
in matrix notation:
⎡ E (u12 ), E (u1u2 )""""" E (u1un ) ⎤ ⎡σ 2 0 0 0⎤
⎢ ⎥ ⎢ ⎥
⎢ E (u1u2 ), E (u22 )""""" E (u2un ) ⎥ ⎢ 0 σ 2 σ 0⎥
E ⎡⎣UU ⎤⎦ = ⎢
'
⎥ = = σ 2 In
# ⎢ 0 0 σ 2
0 ⎥
⎢ ⎥ ⎢ ⎥
⎢⎣ E (u1un ), E (u2un )"""" E (un ) ⎥⎦ ⎣ 0
2
0 0 σ2⎦
68
4) The matrix X has full column rank equal to k. This means, it has k linearly
independent columns. It implies that the number of observations exceeds number of co-
efficient to be estimated (n > k). It also implies that there is no exact linear relationship
among the X variables. This in fact is the assumption of the absence of multicollinearity.
Note: Assumptions that E(uiuj) = 0 means that error terms are not correlated. The
implication of diagonal terms in matrix E(UU′) being all equal to σ 2 is that all error
terms have the same variance, σ 2 . This is also called the assumption of homo-
scedasticity.
We can write the regression relation for the sample as:
e = Y – Xb
By equating 1st order partials of φ w.r.t each bi (i = 1…k), to zero, we get k normal
equations. This set of equations in matrix form is:
∂φ
= −2 X ′Y + 2 X ′Xb = 0 (3.34)
∂b
X ′Xb = X ′Y (3.35)
when X has rank equal to k, the normal equation 3.35 will have a unique solution and
least squares estimator b is equal to:
b = [ X ′X ] [ X ′Y ]
−1
(3.36)
We have assumed b to be estimator for β and thus E(b) = β , therefore we can rewrite
3.36 as
b = [ X ′X ] X ′′[Xβ + u ]
−1
= [ X ′X ] X ′Xβ + [X ′X ] X ′U
−1 −1
[
= β + X 1 X X 1U ]
1
∴ E (b) = E ( β ) + E X 1 X [[ ]
−1
]
X 1 E ( μ ) = E ( B) + [X ′X ]. X ′E ( μ )
=β
69
Variance of b = σ 2 ( X ′X ) −1
Notes
1. In this course our objective is simply to introduce the concepts. You will find the
concepts at much more rigorous level in the courses on Basic Econometrics (REC
003) – included as compulsory course of M.Phil/Ph.D. Programme in Economics.
2. The other ideas regarding coefficient of determination R2 and adjusted R2 remain the
same as they were developed for two explanatory variable case.
Many a times X variables may be found to have some other linear relationships among
themselves. This vitiates our classical regression model.
Let us give specific names to variables, say, X1 is price of commodity Y and X2 is family
income. We expect β1 to be negative and β2 to be positive. Now we go one step further.
Let Y be demand for milk, X1 be price of milk and suppose, the family wise demand for
milk is being estimated for a family, which also produces and sells milk, Clearly, larger
the value of X1, higher the magnitude of X2 will be.
In such situations, the estimation of price and income co-efficients will not be possible.
Recall, we wanted variables X in our matrix equations to be linearly independent. If that
conditions is not satisfied X matrix becomes singular, that is, its determinant will be
equal to zero. Thus, there will be no solution to the normal equation 3.34 (or 3.35).
However, if co-linearity is not perfect, we can still get OLS estimates and they remain the
best linear unbiased estimates (BLUE) – though one or more partial regression co-
efficient may turn out to be individually insignificant.
Not only this, the OLS estimates still retain the property of minimum variance. Further, it
is found that multi-co-linearity is essentially a sample regression problem. The X
variables may not be linearly related in population but some of our suppositions while
drawing a sample may create a situation of multiple linear relations in the sample.
The Classical Linear Regression Model has a significant underlying assumption that all
the error terms are identically distributed with zero mean and the same standard deviation
equal to σ (or variance equal to σ2). The second part of the assumption; that errors have a
constant standard deviation or variance is known as the assumption of homoscedasticity.
What happens when this assumption of homoscedasticity does not hold? In symbolic
terms, E (u i ) 2 = σ i2 i = 1" n , that is, if the expectation of squared errors is no longer
equal to σ2 ⎯ each error term has its own σ2, or variance that varies from observation to
observation.
It has been observed that usually time series data do not suffer from this problem of
hetero-scedasticity but in cross-section data, the problem may assume serious
dimensions.
The consequences of heteroscedasticity:
If the assumption of homoscedasticity does not hold, we observe the following impact on
OLS estimators.
In applied regression analysis, plotting the residual terms can give us important clues
about whether or not one or more assumptions underlying our regression model hold. The
pattern exhibited by ei2 plotted against the values of the concerned variable can provide
71
Our ability to tackle the problem will depend upon the assumptions that we can make
about the error variance. Thus, the following situations may emerge
i) When σ i2 is known
Here the CLRM
Yi = β 0 + β1 X 1 + u i can be transformed by dividing each value by the
corresponding σi. Thus,
Yi ⎛ 1 ⎞ ⎛X ⎞ u
= β 0 ⎜⎜ ⎟⎟ + β 1 ⎜⎜ i ⎟⎟ + i
σi ⎝σi ⎠ ⎝ σi ⎠ σi
ui
This effectively transforms the error terms to which can be shown to be
σi
homoscedastic and therefore, the OLS estimators will be free of disability caused by
heteroscedasticity. The estimates of β0 and β1 in this situation are called Weighted Least
Squares Estimators (WLSEs).
ii) When σ2 is unknown: we make some further assumptions about error variance:
(a) Error variance proportional to the Xis. Here, the Square Root transformation is
enough. We divide on both sides by X i . Thus, our regression line looks like:
Yi ⎛ 1 ⎞ ⎛ ⎞
= β0 ⎜ ⎟ + β1 ⎜ X i ⎟ + ui
Xi ⎜ X ⎟ ⎜ X ⎟ Xi
⎝ i ⎠ ⎝ i ⎠
1
= β0 + β1 X i +ν i
Xi
ui
hereν i =
Xi
and this is sufficient to address the problem.
Yi 1 u
= β0 + βi + i
Xi Xi Xi
1
= β0 + β i + ηi
Xi
ui
The error term will be ηi = and this will be free of heteroscedasticity and thus, will
Xi
facilitate the use of CLS techniques.
iii) Respecification of the Model: Assigning a different functional form to the model, in
place of speculating about the nature of variance may be found to be expedient. For
example, instead of the original model, we can estimate this model:
ln Yi = β 0 + β1 ln X i + u i
This loglinear model is usually adequate to address our concerns.
The classical regression model also assumes that disturbance terms uis do not have any
serial correlation. But, in many situations this assumption may not hold. The
consequences of the presence of serial or auto correlation are similar to those of hetero
scedasticity: the OLS are no longer BLUE. Symbolically no autocorrelation means
E(uiuj)= 0, when i ≠ j.
Autocorrelation can arise in economic data on account of many factors:
iii) Cobweb Phenomenon is another factor which may also give rise to the problem
of autocorrelation in certain types of economic time series (especially agricultural
output and the like).
There are many tests for detecting autocorrelation. Some of them can be visual inspection
of error plots, Runs Test and Swed-Eisenhart Critical Runs Test. But the test most
commonly used is Durbin-Watson d Test. This is defined as
n
∑ (e t − et −1 ) 2
d= t =2
n
∑e
t =1
2
t
However, again, we are holding back information on practical detections and avoidance
of problem of autocorrelation for the reasons of limitation of space here.
1. State the various forms of regression models. When will you use a log linear
regression model? Give an illustration in support of your answer.
2. How do you interpret the estimated slope coefficient of a log linear regression
model?
3. If you want to estimate India’s rate of growth of per capita income during the
period 1990-2008, what should be the functional form of your regression
model?
4. How do you interpret coefficients of multiple regression model? Give an
example in support of your answer.
74
4.0 OBJECTIVES
The main objectives of this block are to:
• apprise you of the philosophical foundations and research perspectives guiding
the qualitative research,
• to explain the various principles governing the participatory method of qualitative
approach,
• discuss the process and stages involved in participatory method,
• apply the various tools and techniques of PRA approache in research,
• appreciate the limitations and challenges faced in participatory method, and
• explain the principles, research design and the steps involved in conducting
studies by applying case study method.
76
4.1 INTRODUCTION
Research methodology deals with the branch of philosophy that analyses the principles
and procedures of scientific inquiry in a particular discipline with a set of pedagogy to
understand complex reality. Principles and procedures of scientific enquiry tend to unfold
causality of factors to understand complex phenomena through empirical evidences and
their validations. Empirical evidences are captured through quantitative and qualitative
approaches and variables. The first three blocks of this course contain the different
aspects of the quantitative approach namely Foundations of Research Methods, data
collection and Analysis of Data through Quantitative Methods. Quantitative approach
broadly deals with data and sampling errors but still lacking reliability of data on account
of non-sampling errors to handle such situations. On the other hand, Qualitative approach
is an in depth scientific enquiry of complex events, their dimensions and variables, which
are difficult to be captured through cardinal or quantitative approach. For example, it is
easier to collect data on income and expenditure of the households canvassing simple
structured questions in the Keynesian framework of psychological laws. Moreover,
traditional quantitative research methods are considered time taking exercise to produce
results which are often irrelevant for particular time bound policy drive. These methods
also involve high cost of formal surveys. Keeping in view all these limitations of
quantitative approach, this block covers qualitative approach to research methods, tools
and techniques of data collections, formatting, processing and analysis of data, report
writing, etc. This block focuses on a few important methods of qualitative approach:
participatory rural appraisal (PRA) and case study method (CSM).
truth or reality (ontology) and the theory of knowledge dealing with how can we
know the things that exist (epistemology).
• World view or framework that guides research and practice in field,
• General methodological prescriptions including instrumental techniques about
how to conduct work within the paradigm.
important than the similarities. Critical theory and interpretativism are the most
important paradigms qualitative research. The peculiar features of these paradigms are:
• They differ on the question of reality.
• They offer different reasons or purposes for doing research.
• They point us to quite different types of data and methods as being valuable and
worthwhile.
• They have different ways of deriving meaning from the collected data.
• They vary in the relationship between research and practice.
The above three paradigms have been the dominant guiding frameworks in research in
the social sciences.
Differences between Post positivism and Critical Theory on the Five Major Issues
Post Positivism Critical Theory
Nature of reality Material and external to the Material and external to the
human mind human mind
Purpose of research Find universals Uncover local instances of
universal power
relationships and empower
the oppressed
Acceptable methods and Scientific method Subjective inquiry based on
data Objective data ideology and values; both
quantitative and qualitative
data are acceptable
Meaning of data Falsification is Interpreted through
Used to test theory ideology; used to enlighten
and emancipate
Relationship of research to Separate activities Integrated activities
practice Research guides practice Research guides practice
(Source: Foundations of Qualitative Research by Jerry W. Willis (2007) p.p.83)
Differences between Post positivism and Interpretivism on the Five Major Issues
Post Positivism Interpretivism
Nature of reality External to human mind Socially constructed
Purpose of research Find universals Reflect understanding
Acceptable methods and Scientific method Subjective and Objective
data research methods are
acceptable
Meaning of data Falsification Understanding is contextual.
Use to test theory Universals are deemphasized
Relationship of research to Separate activities Integrated activities
practice Research guides practice Both guide and become the
other
(Source: Foundations of Qualitative Research by Jerry W. Willis (2007) p.p.95)
Qualitative researchers have the option to choose conceptual frameworks out of various
attachments. A framework is a set of broad concepts that guide research. Researches
working within interpretive and critical theory paradigms have a number of frameworks
to choose. There are many commonalities of these frameworks but there are many
differences too. These differences often lead to the development and use of different
research methods.
The important frameworks that appeal to a number researches today include:
• Analytic Realism
• Interpretive perspective
• Einsner’s connoisseurship model of inquiry
• Semiotics
• Structuralism
• Post structuralism and Post modernism
All these frameworks put forward different options before a qualitative researchers and
point towards certain research methods, goals and topic. Under the positivist or post-
positivist paradigm, a researcher undertakes the research by stressing the variables,
hypothesis and propositions derived from a particular theory that sees the world in terms
of cause and effect. On the other hand, under the interpretative paradigm, emphasis is
laid on socially constructed realities, inter-subjectivity, local generalisations, practical
reasoning and ordinary talk. Critical researches underline the importance of terms like
action, structure, culture and power fitted into a general model of society. Under the
feminist perspective, research focuses on gender, reflectivity, emotion and an action
orientation.
All these frameworks have different applications in different disciplines of social
sciences and provide options towards certain research methods, goals and topics. Two
characteristics – the search for contextual understanding, the place of universal laws and
correspondingly research design – make the qualitative approach distinct from post
positivist quantitative approach.
Read Chapter 5 Frameworks for qualitative Research , Foundations of qualitative
Research by Jerry W.Wills (2007), Sage Publications, P 147-181
4. The abductive strategy starts with laying the concepts and meanings that are
contained in social quarters’ accounts of activities related to a research
problem.
In short, RRA methods are more verbal with outsiders more active, while PRA methods
are more visual with local people more active. The methods between two approaches are
broadly shared. Thus
(i) RRA approach is extractive-elicitive in nature wherein data is collected by
outsiders.
(ii) PRA is sharing-empowering approach where the main objectives are variously
investigation, analysis, learning, planning, action, monitoring and evaluation
by insiders.
In practice, there is a continuum between RRA and PRA in the following manner:
Table 4.2: RRA and PRA continuum
________________________________________________________________________
Nature of process RRA PRA
________________________________________________________________________
Mode Extractive elicitive Sharing
empowering
It was developed in Thailand from 1978 onwards and has combined analysis of ecology
and system properties with pattern analysis of space, time, flows and relationship, relative
values and decisions. It has contributed significantly to RRA and PRA particularly
transacts and informal mapping, diagramming, scoring and ranking.
Some of the major contributions of agroecosystem analysis to current RRA and PRA
have been:
- transects (systematic walks and observation);
- informal mapping (sketch maps drawn on site);
- diagramming (seasonal calendars, flow and causal diagrams, bar charts, Venn or chapati
diagrams);
- innovation assessment (scoring and ranking different actions).
(c) Applied anthropology
It took over stage of social anthropology in 1980s. Rapid assessment procedure (RAP)
and rapid ethnographic assessment (REA) were adopted in the field of health and
nutrition. In this exercise conversation, observation, informal interviews, and focus
group, etc., were used for data collection. Idea of field learning, participant observation,
importance of attitude behaviour and rapport and validity of local knowledge are major
contribution to RRA and PRA. Some of the many insights and contributions coming from
and shared with social anthropology have been:
- the idea of field learning as flexible art rather than rigid science;
- the value of field residence, unhurried participant- observation, and conversations;
- the importance of attitudes, behavior and rapport;
- the emit-etic distinction;
- the validity of indigenous technical knowledge.
(d) Field research on farming systems
It is of multi-disciplinary approach for complex and diversified problems with
systematized methods for investigating, understanding, and prescribing for farming
system complexity. In this method, farmers’ capabilities of experiment are recognized.
Field research on farming systems have contributed the appreciation and understanding
of:
-the complexity, diversity and risk-proneness of many farming systems;
-the knowledge, professionalism and rationality of small and poor farmers;
-their experimental mindset and behavior;
-their ability to conduct their own analyses.
4.4.3 Principles of PRA
Principles of PRA evolved in due course of experiments and results drawn from
development tourism. List of principles therefore varies from practitioners to
practitioners, as and when it evolved. However, there are certain commonalities shared
by most of them.
(i) Principles shared by RRA and PRA:
(a) Reversal of Learning: It is a departure from dominant paradigm of
learning from formal institutions and from consolidated published
information. In this approach, face to face learning from the people takes
place at site from their local, physical, technical and social knowledge and
analysis.
(b) Learning rapidly and progressively: It has inbuilt adaptable rapid learning
process with conscious exploration, flexible use of methods,
improvisation, iteration and cross checking and not following strictly a
blue print programme.
84
A PRA activity involves a team of people working for two to three weeks on workshop
discussions, analyses, and fieldwork. Several organizational aspects should be
considered:
• Logistical arrangements should consider nearby accommodations, arrangements
for lunch for fieldwork days, sufficient vehicles, portable computers, funds to
purchase refreshments for community meetings during the PRA, and supplies
such as flip chart paper and markers.
• Training of team members may be required, particularly if the PRA has the
second objective of training in addition to data collection.
• PRA results are influenced by the length of time allowed to conduct the exercise,
scheduling and assignment of report writing, and critical analysis of all data,
conclusions, and recommendations.
• PRA covering relatively few topics in a small area (perhaps two to four
communities) should take ten days to four weeks, but a with a wider scope over a
larger area can take several months. Allow five days for an introductory workshop
if training is involved.
85
• Reports are best written immediately after the fieldwork period based on notes
from PRA team members. A preliminary field report should be available within a
week or so. Final report should be made available to all the participants and the
local institutions involved.
In order to ensure that people are not excluded from participation, generally these
techniques avoid writing wherever possible, relying on the tools of oral communication
like pictures, symbols, physical objects and group memory. Efforts are made in many
projects, however, to build a bridge to formal literacy; for example by teaching people
how to sign their names or recognize their signatures. Tools of data collection for RRA
and PRA are often overlapping and of supplementary nature but the basic difference lies
in the forms of ownership and end user.
Team contracts and interactions:
It is always better to decide collectively and unanimously with permissible variation
about certain norms and behaviour before team proceeds for PRA. It may agree to
interact on issues with certain distance or closeness, mode of discussion and its
consolidation, division of labour, etc.
Role reversal:
It is very important that unlike development tourism in PRA, catalysts are always in
learning mode. They are merely facilitators. Therefore, dominant behaviour and
arrogance of being custodian of solution and knowledge is not acceptable. Deliberate
effort to change behaviour and attitude is essential for a successful PRA. They may
initiate discussion and leave the site or sit back passively but observe carefully. No
manual is strictly followed. Group may decide collectively as they find effective.
Feed back session:
In order to review and consolidation of progress feed back session is essential element in
PRA.
Transact walks:
In this exercise, facilitators create an environment in which surroundings are discussed
and learnt with local people while walking through the village with people of the area.
This gives an empowerment and opportunity for local people to get involved in sharing
their observations in discussions about features of village, its qualities of resources,
infrastructures, technological levels in production, patterns of uses, production and
productivities, life styles, customs and festivals, public sharing events, leadership
qualities, problems of locality, solutions, hurdles, etc. Facilitators have to be patient
86
listeners while transacting walks and providing opportunities to local people to present
information through various forms: mapping, modelling, using symbols, etc. Thus, this
exercise empowers local people to consolidate information with collective wisdom for
initiating data collection by them.
Identification of Key informants:
Identification of articulated experts of the area is an important task on which success and
quality of PRA depends. Key informants are identified through participatory social
mapping of a village.
Social Mapping:
It provides a basis for household listings, and for indicating population, social group,
health and other household characteristics. This can lead to identification of key
informants and discussions with them. A village social map provides an up dated
household listing to be used for well-being or wealth ranking of households, based on
these lists, focus group consisting of different categories of people are formed. These
groups express their different preferences leading to discussion, negotiation and
reconciliation of priorities. Resource maps help to understand the natural and
environmental settings in a particular village.
Contrast and comparison with various dimensions empower local people to understand
their problems with various dimensions.
Estimates and quantifications:
Collecting data in local units by the local people is easier. Even illiterates with small
pebbles or smaller parts of brick chips, seeds, sticks, drawing lines and tallies on the
ground or walls can serve the purposes of counting for data. They manage their accounts
with their phenomenal memories about the events. Combining with mapping, modelling
and matrices, an excellent quantification may be made by the local people.
Key probes:
Direct questions relating to set objectives of investigation lead to focussed discussion for
non controversial issues. Indirect narratives help in initiating discussions on controversial
issues.
Stories portraits and case studies:
Narratives provide excellent insights for which local people are inexhaustible treasure.
They can reflect through memory lane about the events what they resolved or not
resolved and their outcomes. Case studies help in unfolding and correcting general
perceptions.
Presentations and analysis:
It is always better to make presentation and cross check by local people. If at all they
need a little direction, facilitator can help in but participatory spirit of presentation must
not be distorted.
After analysis of participatory action planning, budgeting, implementation and
monitoring need be decided with time frame.
Report writing:
Report writing is to be done without delay in the field itself collectively dividing
assignments among the designated people involved in the process of learning through
PRA sequences. Feedback from the groups and local people is always correct approach to
validate understanding of the problems and solutions.
4.4.6 Sequence of Techniques
PRA techniques can be combined in a number of different orders and ways depending
upon the topic focus, goal and objectives under investigation. Some general rules of
thumb, however, may be useful. Rapport building is the core of the success of any PRA.
Mapping and modelling are good techniques to start with because they involve several
people, stimulate much discussion and enthusiasm, provide the PRA team with an
overview of the area, and deal with noncontroversial information. Maps and models may
lead to identification of key informants; transect walks, perhaps accompanied by some of
the people who have constructed the map; listing of the households. Wealth ranking is
best done later in a PRA, once a degree of rapport has been established, given the relative
sensitivity of this information followed by focus group discussion, matrix scoring, and
preference ranking. However, sequence of techniques should be decided by the groups
through brainstorming. This exercise of group discussion may take place more than once
depending upon the felt need. The group may decide stages to include or exclude outsider
while conducting group discussion. The current situation can be shown using maps and
models, but subsequent seasonal and historical diagramming exercises can reveal changes
and trends, throughout a single year or over several years. Preference ranking is a good
ice breaker at the beginning of a group interview and helps focus the discussion. Later,
individual interviews can follow up on the different preferences among the group
members and the reasons for these differences.
4.4.7 Practical Applications
89
This approach has been popular in the field of natural resource management, agriculture,
implementation of rural development programmes, poverty eradication and social
development: health, education and food security, etc. Now it has spread in creating and
managing self help group, marketing and commercial sector also.
Read The Origins and Practice of Participatory Rural Appraisal, by Robert Chambers,
World Development, vol.22 N.7 pp 953-963, 1994
4.4.10 Challenges
Real challenge is to make it participatory in true sense. Rural society is so complex and
chained in various sacks of identities such as caste, religion, power groups and class, it is
not easy to make the group participatory in real sense. Rapport building and allowing
space for the poor in participating with their experience and wisdom has long way to go.
However, Robert Chambers has considered and listed seven challenges given below:
(a) Beyond farming system research
(b) Participatory alternative to questionnaire surveys
(c) Issues of Empowerment and equity
(d) Local people as facilitator and trainers
(e) Policy research and change
(f) Personal behaviour attitude and learning
(g) PRA in organisations
90
These approaches may have many shortcomings but elements of creative learning,
empowerment and ownership of results have made them distinctly different from others.
Read Participatory Rural Appraisal (PRA): Challenges, Potentials and Paradigm by
Robert Chambers, World Development, vol.22 N.10 pp 1437-1454, 1994
When it is designed to test or illustrate a theoretical point, it will deal with the case as an
instance of a type, describing it in terms of a particular theoretical framework (implicit or
explicit). When it is exploratory or concerned with developing theoretical ideas, it is
likely to be more detailed and open-ended in character. The same is true when the
concern is with describing and/or explaining what is going on in a particular situation for
its own sake. When the interest is in some problem in the situation investigated, the
discussion will be geared to diagnosing that problem, identifying its sources, and perhaps
outlining what can be done about it. Variation in purpose may also inform the selection
of cases for investigation.
91
Read: Philosophy of Social Research, Vol. 1 (pp.92 to 94) in The Sage Encyclopedia of
Social Science Research Methods by Michael S. Lewis-Beck, Alan Bryman, Tim Futing
Liao (ed.), Sage Publications (2004)
4.5.1 Types of Case Studies
Broadly there are three types of case studies: exploratory, explanatory, and descriptive..
Each of those three approaches can be either single or multiple-case studies, where
multiple-case studies are replicators, not sampled cases.
(a) Exploratory: In this type of case study fieldwork and data collection may be
undertaken prior to defining research questions and hypotheses. In view of time
constraints a willing and easy case needs to be identified.
(b) Explanatory: This type of cases is suitable for doing causal studies.
(c) Descriptive: In these type of cases, investigator begins with a descriptive theory,
or face the possibility that problems will occur during the project.
Case studies have been widely used in case of education, law medicine. Schools of
business have been most aggressive in the implementation of case based learning. It has
also been used in IT sector. Recently farmers’ suicides cases have been studied to
understand agrarian crises.
(b) Select the cases and determine data gathering and analysis techniques
The researcher must determine whether to study cases which are unique in some
way or cases which are considered typical and may also select cases to represent a
variety of geographic regions, a variety of size parameters, or other parameters.
Specific case is identified and in case of multiple cases, each case is treated as
single unit. The researcher must use the designated data gathering tools
systematically and properly in collecting the evidence. Throughout the design
phase, researchers must ensure that the study is well constructed to ensure
construct validity, internal validity, external validity, and reliability
3. What is the distinction between PRA and RRA approach of qualitative research?
Discuss the various methods and techniques of PRA with illustrations.
4. Develop a one or two page plan for a research study on topic of your choice
involving semi-structured interview as a major source of data.
5. Do you think that a researcher would make more progress using different
frameworks for different studies in the field? Give reasons.
6. How do the strategies of qualitative enquiry affect the method of data/material
collection?
7. Explain how interpretivist philosophy of science is significant departure from post
positivism?
8. Frame a research proposal of your own choice specifying the purpose of research
to conduct the study from a critical theory perspective.
9. Frame a research proposal of your own choice specifying the purpose of research
to conduct the study from a interpretive perspective.
10. Make a distinction between case study method and experimental method. Explain
the different steps involved in case study method.
11. What are the main sources of participatory rural appraisal?
95
Structure
5.1 Introduction
5.2 Objectives
5.3 An Overview of the Theme
5.4 Macro Variable Data
5.4.1 The Indian Statistical System
5.4.2 National Income & Related Macroeconomic Aggregates
5.4.3 National Income & Levels of Living
5.4.4. Saving
5.4.5 Investment
5.5 Agricultural Data
5.5.1 Introduction
5.5.2 Agricultural Census
5.5.3 Studies on cost of Cultivation
5.5.4 Annual Estimates of Crop Production
5.55 Livestock Census
5.5.6 Data on Production of Major Livestock Products
5.5.7 Agricultural Statistics at a Glance (ASG)
5.5.8 Another Source of Data on Irrigation
5.5.9 Other Data on the Agricultural Sector
5.6 Industrial Data
5.6.1 Introduction
5.6.2 Data Sources Covering the Entire Industrial Sector
5.6.3 Factory (Registered) Sector – Annual Survey of Industries (ASI)
5.6.4 Monthly Prodn. of Selected Industries and Index of Industrial Production (IIP)
5.6.5 Industrial Credit and Finance
5.6.6 Contribution to GDP
5.7 Trade
5.7.1 Introduction
5.7.2 Merchandise Trade
5.7.3 Services Trade
5.7.4 E-Commerce
5.8 Finance
5.8.1 Introduction
5.8.2 Public Finances
5.8.3 Currency, Coinage, Money and Banking
5.8.4 Financial Markets
5.9 Social Sectors
5.9.1 Introduction
5.9.2 Employment, Unemployment & Labour Force
5.9.3 Education
5.9.4 Health
5.9.5 Environment
5.9.6 Quality of Life
5.1 INTRODUCTION
We noted in Block 2 that statistical data constitutes an essential input to the research
process and talked about the methods and tools of data collection. One of the tools of data
collection, we noted, is to assemble secondary data or data already collected, compiled
and published by other agencies and make use of the same if these met the requirements
of the proposed research endeavour. A large number of Government agencies and several
non- Government agencies collect, compile, analyse and publish data on various aspects
of the Indian economy and society. Such data cover the performance of the economy in
different directions, socio-cultural trends and the impact of such performance on the
levels of living of different sections of society. Let us look at in this Block the kind of
data available, their quality, reliability and timeliness for the purposes for which these are
collected.
5.2 OBJECTIVES
We shall look at the database of the Indian economy in this Block. Section 5.4 starts with
a short description of the Indian statistical system and then moves on to discuss the kind
of data compiled and disseminated on the overall performance of the economy - macro
variables depicting it like the national income and state income, the national and regional
accounts, the input output transaction table depicting inter-relationships between
economic activities, instruments facilitating growth like saving and investment and
finally, the real test of economic performance, namely, standards of living of the people.
The databases of important individual sectors of the economy are dealt with in the
subsequent sections. Section 5.5 discusses available data on the production of agricultural
97
Each Section/subsection ends with a box guiding the reader to relevant portions of one or
more publications that contain more details on the subject handled in it. Full details of
these publications are indicated in Section 5.12. Section 5.11 is to enable the reader to be
in touch with emerging developments relating to the review, refinement and expansion of
the database in different aspects/sectors of the economy. Section 5.13 is for evaluation of
the reader’s knowledge of the subject matter covered in this Block.
The Indian Statistical System generates data generally through large scale enquiries like
the Census or sample surveys of the kind conducted by the National Sample Survey
Organisation (NSSO), periodic statutory returns received by Government
Departments/organisations and as a by-product of administration at different levels.
Someone has to take the lead, in such a situation, to ensure adoption of appropriate
standards, concepts and definitions for the phenomena on which statistical data are
collected. The necessary institutional structures were created in India in the early Fifties
and strengthened over the years. Most recently, the National Statistics Commission
(NSC) made detailed recommendations to revamp the Indian statistical system to ensure
the quality, reliability and timeliness of data generated by the system.
evolve, monitor and enforce statistical priorities and standards and to ensure statistical
co-ordination among the different agencies involved. The Chief Statistician of India
functions as the Secretary to NCS and also as the ex-officio Secretary, Ministry of
Statistics & Programme Implementation (MOSPI). The Statistics Wing of MOSPI,
functioning under the guidance of NCS, is the nodal Ministry in the Government of India
for the integrated development of the statistical system in the country, coordination of the
work statistical directorates/divisions in the Central Ministries and the State Governments
and for all policy matters relating to the Indian Statistical Institute (ISI). It has under it
three organisations, the Central Statistical Organisation (CSO), the National Sample
Survey Organisation (NSSO) and the Computer Centre (CC). The State Directorate of
Economics & Statistics (SDES) is at the apex of the system at the State level, responsible
for coordination of statistical activities carried on by statistical cells/divisions/directorates
in different departments. SDESs have statistical offices in the districts and, in some cases,
also in the regions. CSO has revived the Conference of Central and State Statistical
Organisations (COCSSO). It is being held annually to deliberate matters relating to the
development of statistical data on aspects of socio-economic life of the country. Agencies
concerned disseminate the data they collect, process and analyse to data users in print or
in electronic formats. CSO disseminates not only its own data but also those relating to
different sectors and aspects of the economy and society published by other Government
agencies. So do the Reserve Bank of India (RBI) and several non-Government sources.
SDESs provide a similar service in the States.
value addition computed for all sectors/ activities of the economy that is referred to as the
National Product and (ii) macro-aggregates related to it and (iii) trends in (i) and (ii), that
can help us in analysing the performance of an economy.
National Income (NI) is the Net National Product (NNP). It is also used to refer to the
group of macroeconomic aggregates like Gross National Product (GNP), Gross Domestic
Product (GDP) and Net Domestic Product (NDP). All these of course refer to the total
value (in the sense mentioned above) of the goods and services produced during a period
of time, the only differences between these aggregates being depreciation and /or net
factor income from abroad. There are other macroeconomic aggregates related to these
that are of importance in relation to an economy. What data would you, as a researcher or
an analyst, like to have about the health of an economy? Besides a measure of the
National Product every year or at smaller intervals of time, you would like to know how
fast it is growing over time. What are the shares of the national product that flow to
labour and other factors of production? How much of the national income goes to current
consumption, how much to saving and how much to building up the capital needed to
facilitate future economic growth? What is the role of the different sectors and economic
activities – in the public and private sectors or in the organised and unorganised activities
or the households in the processes that lead to economic growth? How does the level and
pattern of economic growth affect or benefit different sections of society? How much
money remains in the hands of the households for consumption and saving after they
have paid their taxes (Personal Disposable Income) – an important indicator of the
economic health of households? What is the contribution of different institutions to
saving? How is capital formation financed? Such a list of requirements of data for
analysing trends in the magnitude and quality of, and also the prospects of, efforts for
economic expansion being mounted by a nation can be very long. Such data, that is,
estimates of national income and related macroeconomic aggregates form part of a
system of National Accounts that gives a comprehensive view of the internal and external
transactions of an economy over a period, say, a financial year and the interrelationships
among the macroeconomic aggregates. National Accounts thus constitute an important
tool of analysis for judging the performance of an economy vis-à-vis the aims of
economic and development policy.
CSO of MOSPI compile and publish National Accounts, which include estimates of
National Income and related macroeconomic aggregates like NNP, GNP, GDP & NDP,
consumption expenditure, saving, capital formation and so on for the country and for the
public sector for every financial year. Quarterly Estimates (Qtly.Es) of .GDP are also
made. Estimates are prepared for any year at the prices prevailing in that year, that is,
estimates at current prices and at also at constant prices, that is, at the prices of a selected
year (called the base year). CSO changes the base year from time to time to take into
account the structural changes in the economy and depict a true picture of the economy.
The base year from January, 2006 is 1999-2000. Estimates of national accounts
aggregates are published in considerable detail in CSO’s Annual publication National
Accounts Statistics (NAS), the latest being NAS 2008. CSO releases through Press
Notes every January (on the 30th this year), Quick Estimates (QE) of GDP, National
Income, per capita National Income and Consumption Expenditure by broad economic
sectors for the financial year that ended in March of the preceding year (time lag - ten
months) and Revised Estimates (RE) of national accounts aggregates for earlier financial
years. Further, Advance Estimates (AEs) of GDP, GNP, NNP and per capita NNP at
factor cost for the current financial year are also released in February - two months
before the close of the financial year. (AEs for 2008-09 released on 9/2/2009.) These AEs
are revised thereafter and the updated AEs are released by the end of June, three months
after the close of the financial year. Meanwhile, by the end of March, Qly.Es of GDP for
the quarter ending December of the preceding year are also released. Thus by the end of
every financial year (31st March), AEs for that financial year, QEs for the preceding
financial year and the Qtly.Es up to the quarter ending December of the financial year)
become available. In fact, CSO sets before itself an advance release calendar for the
release of national accounts statistics over a period of two years, in line with SDDS
requirements.
NAS (NAS 2008) presents QEs of macroeconomic aggregates for 2006-07, AEs for
2007-08 and Qtly.Es of GDP for 1999-00 to 2007-08, summary statements of GNP,
NNP, GDP and NDP at factor cost at constant (1999-00) prices and market prices and
estimates of the components of GDP, aggregates like Government Final Consumption
Expenditure (GFCE), Private Final Consumption Expenditure (PFCE) in the domestic
market, Exports, Imports, the share of the public sector in GDP, industry wise GDP &
NNP, GDP at crop/item/category level and the consolidated accounts of nation.
CSO’s estimates of NDP for rural and urban areas by economic activity at current prices
for 1970-71, 1980-81 and 1993-94 are published in NAS 2000. The list of publications of
the National Accounts Division (NAD) of CSO can be seen in the MOSPI website.
CSO’s Monthly Abstract of Statistics (MAS) and the annual Statistical Abstract of
India (SAI), RBI’s Monthly Bulletin and the Handbook of Statistics on the Indian
Economy (2008), RBI website http://www.rbi.org.in), Centre for Monitoring Indian
Economy (CMIE) (Economic Intelligence Unit – EIU), Mumbai publication
National Income Statistics and the publication of Economic and Political Weekly
Research Foundation - EPWRF (EPWRF, December, 2004)”, and www.epwrf.res.in
also give time series estimates of national income and related macro aggregates.
The concepts and methodology used and the data sources utilised for making these
estimates are set out in two publications, namely, (CSO 2007) and (CSO, 1999a). The
methodology for (i) the New series of National Accounts Statistics with base year 1999-
2000 is given in the Brochure on New Series on NAS (Base Year 1999-2000), (ii) AEs
in: NAS 1994, (iii) estimates of factor incomes in NAS – Factor Incomes (March, 1994)
and (iv) Qtly.Es of GDP in a Note in NAS 1999. Besides, the publication NAS of every
year has a chapter “Notes on Methodology and Revision in the Estimates”. Sections 13.2
& 13.3, Chapter 13, of the NSC Report, (pp. 436 to 492) also contain methodological
and conceptual details and data sources utilised in estimating National Income and related
macroeconomic aggregates, data gaps and measures to overcome these. Changes in and
adoption of improved methodology, expansion of the coverage of the estimates, change
in the base year, improvements in the quality of data and the use of new data sources over
the last 50 years all have their beneficial impact on the quality of national income
estimates, etc., that is, in estimating the “true values” of the aggregates as correctly as
possible. These efforts can also affect the comparability of estimates over time although
CSO does make all efforts to minimise the level of non-comparability.
Read the chapter “Notes on Methodology and Revisions in the Estimates” in CSO
(2005), pp.220 – 228; and also the same chapter in CSO (2008).
Any economic activity is dependent on inputs from other economic activities for
generating its output and the output from this economic activity serves as inputs for
producing the output from other activities. Data relating to such interrelationships among
different sectors of the economy and among different economic activities are thus
important for analysing the behaviour of the economy and, therefore, for formulation of
development plans and setting targets of macro variables like output, investment and
employment. Such an input-output table will also be useful for analysing the impact of
changes in a sector of the economy or economic activity on other sectors of the economy
and indeed the entire economy. CSO publishes an Input-Output Transaction Table (I-
OTT) every five years since 1968. The latest is the one relating to 2003-04. It gives,
besides the complete table, the methodology adopted, the database used, analysis of the
results and the supplementary tables derived from the I-OTT giving the input structure
and the commodity composition of the output. The Planning Commission updates and
recalibrates the I-OTT and prepares Input-Output Tables (I-OT) for the base and the
terminal years of a Five Year Plan and publishes the results of such an exercise as the
Technical Note to the Five Year Plan. (The latest is the one for the Tenth Plan.) It
102
contains the relevant I-OT, the methodology adopted and related material. The two I-OTs
are useful in economic and econometric analysis.
(a) Estimates of State Domestic Product (SDP) Prepared and Released by State
Governments and Union Territory Administrations
State Accounts Statistics (SAS) consist of various accounts showing the flows of all
transactions between the economic agents constituting the State economy and their
stocks. The most important aggregate of SAS is the State Domestic Product (SDP) (State
Income). Estimates of GSDP and NSDP at constant and current prices are being prepared
and published by all SDES except those of Dadra & Nagar Haveli, Daman & Diu and
Lakshadweep. These estimates are also available in the CSO website and the
publications of the preceding section and in EPWRF (June, 2003) and its CD ROM..
[Read Section 13.7, Chapter 13, NSC Report, pp. 528 – 535 and Annexures 13.8 to 10]
The preparation of estimates of SDP call for more detailed data than for the preparation
of national level estimates, especially on flows of goods and services and incomes across
geographical boundaries of States/Union Territories. Conceptually, estimates of SDP can
be prepared by two approaches -.the income originating approach and the income
accruing approach. In the former case, the measurement relates to the income originating
to the factors of production physically located within the area of a State. In other words it
is the net value of goods and services produced within a State. In the latter case, the
measurement relates to the income accruing to the normal residents of a State. The
income accruing approach provides a better measure of the welfare of the residents of the
State and also for preparing Human Development Indices (HDI), but it calls for data on
inter-State flow of goods and services and incomes, which are not available. Thus only
the income originating approach is used in preparing estimates of SDP. This has to be
kept in mind while using estimates of SDP. Although efforts have been made by the CSO
over the years to bring about a good degree of uniformity across States and Union
Territories in SDP concepts and methodology, SDP estimates of different States are not
comparable. The successive Finance Commissions got comparable estimates of NSDP
and per capita NSDP made by CSO for their work (available in the Reports of the
successive Finance Commissions). EPWRF (June, 2003) also provides comparable
estimates of SDP and compares these with those made for Finance Commissions. The
question of comparability of estimates of SDPs is important for econometric work
involving inter-State or regional comparisons.
[Read 1. Section 13.7.1 & 13.7.2, Chapter 13, pp. 528 – 535 NSC Report; 2. Preface
and Chapters 5, 7,8 & 10 in EPWRF (2003); 3.CSO (1974); 4. CSO(1976); 5. CSO
(1979); 6. CSO (1980)]
103
The need for preparing estimates of district income has become urgent in the context of
decentralisation of governance and the importance of, and the emphasis on, decentralised
planning. Estimates of District Domestic Product (DDP) are being prepared by ten
SDESs using income originating approach and published in State Statistical
Handbooks/Abstracts/Economic Surveys and also posted on their websites. (Another
State is preparing estimates only for commodity producing sectors.) It is necessary to
make adjustments in these estimates for flow of incomes across territories of districts (or
States) that are rich in resources like minerals and forest resources and where there is a
daily flow of commuters.
[Read 1. Subsection 13.7.7, Chapter 13, NSC Report, p. 532; 2. Paper on Methodology
for DDP in 1996 by SDESs Uttar Pradesh & Karnataka, CSO website; 3. Katyal, R.P.,
Sardana, M.G., Satyanarayana, J. (2001). ]
What do trends in macroeconomic aggregates say about the welfare of different sections
of society? Precious little, perhaps, especially when these are considered without
information on the distribution of these aggregates among these sections of society. Per-
capita national income or even per-capita personal disposable income can only indicate
overall (national) averages. Distribution of population by levels of income can be a big
step forward in understanding how well the performance in the growth of GDP has
translated into, or has not translated into, improvements in levels of living for sections of
society below levels considered the minimum desirable level. It would also help us
analyse trends in levels of inequalities in living standards, unemployment and
employment, quality of employment, the health status of people and the status of women.
Or, to consider all these together, what are the levels of human development and gender
discrimination? Such lines of analysis and the data required for the purpose are important
from the point of view of planning for a strategy of growth with equity.
The quinquennial Consumer Expenditure Surveys of the NSSO, the latest being the
61st Round (2004-05), provide the distribution of households by monthly per capita
consumption expenditure (MPCE) classes. Data on trends in the growth rate of
employment are available from the quinquennial employment and unemployment surveys
of the NSSO (the latest being the 61st Round). These and the GDP data enable us to look
at trends in employment elasticity. Comprehensive indicators like HDI and the Gender
Discrimination Index (GDI) have been prepared for the country and the States by the
Planning Commission (Human Development Report – HDR - 2001) and for individual
States and districts by several State Governments. These contain detailed data on
different facets of levels of living. All the reports are available in print and electronic
on their websites.
Read 1. Chapter 1 (pp. 1 – 6) and Technical Appendix (pp.132 – 133), HDR 2001 of the
Planning Commission; 2. Sub-sections 9.8.6 to 9.8.21, Chapter 9, NSC Report, pp. 333
– 336.
104
5.4.4 Saving
As you are aware, broadly speaking, GNP is made up of consumption, saving, exports net
of imports, besides net factor income from abroad. Saving is important in as much as it
goes to finance investment, which in turn brings about growth of GNP. What is the
volume of Savings relative to GNP? How much of it is consumed by the needs of
depreciation? Who all contribute, and how much, to the total volume of Savings? Let us
see what kind of data is available on such questions.
Estimates of Gross Domestic Saving (GDS) and Net Domestic Saving (NDS) in current
prices and the Rate of Saving are made by CSO and published in the National Accounts
Statistics (NAS) and the Press Note of January of every year releasing Quick
Estimates. . These are first made for any year along with QEs of GDP, etc., and revised
and finalised along with the revision of QEs of GDP. etc., subsequently. The structure of
Saving, that is, the distribution of GDS and NDS by type of institution – household
sector, private corporate sector and public sector are also available in NAS. Part III of
NAS 2008 also presents the time series of estimates of GDS and NDS in current prices
from 1950-51. Statistics on Saving are also published in the publications mentioned
under national accounts. Estimates of Gross and Net Domestic Saving at the State and
Union Territory levels are not being made at present by SDESs (as per the NSC Report).
Limitations that estimates of Savings suffer from are indicated in NSC Report. A high
level Committee on Savings under the Chairmanship of Dr. Rangarajan (2007) is making
a critical review of estimates of savings and investments in the economy.
[Read Sub-sections 13.6.1 to 13.6.6 (pp. 508 – 509) & 13.6.10 to 13.6.16(pp. 511 – 528),
NSC Report.]
5.4.5 Investment
The annual documents NAS of CSO present such data. NAS 2008 presents estimates of
Gross Domestic Capital Formation (GDCF), Gross Domestic Fixed Capital Formation
(GDFCF), Change in Stocks, Consumption of Fixed Capital (cfc), Net Domestic Fixed
Capital Formation (NDFCF) and Net Domestic Capital Formation (NDCF) in current
prices and at constant (1999-00) prices. These estimates are made along with QEs of
National Income every January (as in 2009) and the revision of these estimates proceeds
along with that of the estimates of national income aggregates. Thus NAS 2008 and the
MOSPI Press Note of 30/1/09 also present estimates of the distribution of these
105
aggregates at current and constant prices by type of institutions, by economic activity, the
manner in which CF is financed, external (current and capital) transactions and so on.
Publications referred to in GDP etc., sub-section also present time series of such data but
the one of EPWRF also provides capital-output ratios and average net fixed capital stock
ratios (NFCS) to output ratios (ACOR). CSO publications and the NSC Report contain
the relevant methodological details. Estimates of GFCF at the State level are being
prepared in 14 States. See also the EPWRF publication on NAS. .
Gaps in data for the estimation of capital formation exist in a number of relevant areas, as
indicated in NSC Report.
5.5.1 Introduction
You are aware of the importance of agriculture to the Indian economy and indeed to the
Indian way of life. You would, therefore, like to examine several aspects of agriculture
like the level of production of different crops and commodities, the availability and
utilisation of important inputs for agricultural production, incentives, availability of post-
harvest services and the role of agriculture in development. Similarly, you would like to
know about livestock and their products, fisheries and forestry, people engaged in these
activities and so on. All these analyses require enormous amount of data over time and
space. Let us have a look at what kind of data are available and where.
Started from 1970-71, the seventh census related to 2001. The census collects data on
holdings like its area, the gender and social group of the holder, irrigation status, tenancy
particulars, the cropping pattern and the number of crops cultivated. The Input Survey,
conducted in the following year, gathers data on the pattern of input-use across crops,
regions and size-groups of holdings, covering infrastructural facilities, chemical
fertilizers, organic manures, pesticides, agricultural implements and machinery, livestock,
106
agricultural credit and seeds. The results of the 2001 Agricultural Census and those of
the Input Survey, 1996-97 are on the census website at the national and State levels and
also ASG 2008). The results of the next census (2005-06) and Input Survey (2006-07) are
awaited. Those of Input Survey 2001-02 are being finalized.
[Read 1. Sections 4.9 (pp.136 – 139), 4.14 (pp. 146 – 147) & 4.22 (159 – 161), NSC
Report; 2 http://www.agcensus.nic.in ; 3. Table Set 16, ASG (2008).]
------------------------------------------------------------------------------------------------------------
DESMOA makes annual estimates of area, production and yield of principal crops of
foodgrains, oil seeds, sugar cane, fibres and important commercial and horticulture crops.
These crops account for about 87% of the total agricultural output. Estimates of area and
yield form the basis of these estimates. While estimates of area are based on a reporting
system that is a mix of complete coverage and coverage by a sample, those of yield are
based on a system of crop cutting experiments and General Crop Estimation
Surveys. Advance estimates of crop production are also required even before the crops
are harvested for policy purposes. The first such assessment of the kharif crop is made in
the middle of September, the second – a second assessment of the kharif crop and the
first assessment of the rabi crop – in January, the third at the end of March or early April
and the fourth in June. Time series of final estimates of annual production3, gross area
under different crops and yield per hectare of these crops and Index Numbers on these
variables [base year the triennium ending (TE) 1993-94 = 100] are published in ASG. So
are estimates of production of crops by States.
[Read 1. pp.1 – 4 and Table Set 4, ASG 2008; 2. Sections 4.2 to 4.4, Chapter 4, NSC
Report, pp. 118 – 128.]
3
Crop area forecasts and final area estimates are now sample based as suggested by NSC
107
The latest (17th ) quinquennial livestock census for which results are available, conducted
in October, 2003, collected information, district-wise on livestock, poultry, fishery and
also agricultural implements. Livestock covers cattle, buffaloes, sheep, goats, pigs, horses
and ponies, mules, donkeys, camels, yak, mithun and also pigs, dogs and rabbits. These
are classified by age, sex, breed, function. Poultry covers cock, hen, duck, and drake,
which are classified as desi and ‘improved’ varieties. Fishery covers fishing activity
(inland capture, inland culture, marine capture and marine culture), persons engaged in
fishing, craft/gear by type, size and horsepower, agricultural implements/equipment,
equipment for livestock and poultry and horticulture tools. The results are available on
the DAHDF website up to district level. ASG and BAHS also present some census data.
The fieldwork for the 18th Livestock Census has been completed in October, 2007.
Quick results of the Census based on village/ward are expected by March, 2009.
[Read Section 4.13 & 4.14, Chapter 4, NSC Report pp. 144 – 147; & www.dahd.nic.in
AHSD is responsible for collection of statistics on animal husbandry, dairy and fisheries.
These are published in BAHS. The latest relates to 2006 and it presents data, production
of milk, eggs, meat and wool, per capita availability of milk and eggs, contribution of
cows, buffaloes and goats to milk production and of fowls and ducks to egg production,
imports/exports of livestock/livestock products, area under fodder crops, pastures and
grazing, dry and green fodder production, artificial inseminations performed,
achievements in key components of dairy development, livestock and poultry. State wise
and time series data are presented in most cases.
[Read 1. Section 4.15, Chapter 4, NSC Report pp. 147 – 148; 2. www.dahd.nic.in ; 3.
Table Sets 19 &
The total geographical area of the country is made up of land and water bodies like rivers
and lakes. Land in turn consists of forests, barren and uncultivable land, land used for
non-agricultural purposes, pastures, fallows, cultivable land and so on. What is the
pattern of utilisation of land and how has this pattern been changing over time? How
much is used for agriculture? Land utilisation statistics are available in ASG. ASG also
provides information on size distribution of operational holdings, cropping intensity,
irrigation status, irrigation source, consumption of fertilizer and farmyard manure by size
classes of operational holdings and crops, soil conservation, utilisation of inputs and so
on.
operating costs of the Government Irrigation System over gross revenue is treated as the
imputed irrigation subsidy) and (iv) other subsidies given to marginal farmers and
Farmers’ Cooperative Societies in the form of seeds, development of oil seeds, pulses,
etc. ASG also presents the share of agricultural subsidies in selected OECD countries –
and in particular a that shows the amount of support to farmers, irrespective of the
sectoral structure of a given country.
Other kinds of data on the agricultural sector presented in ASG are procurement of food
and non-food grains, Marketed Surplus Ratios of important agricultural commodities, per
capita availability of important articles of consumption, stocks of cereals, imports and
exports of agricultural commodities and so on.
Besides DESMOA, data on irrigation are collected by the Central Water Commission
(CWC) under the Ministry of Water Resources (MOWR). CWC collects
hydrological data on all the important river systems in the country through 877
hydrological observation sites. The Ministry conducts periodic Censuses of Minor
Irrigation Works along with a sample check to correct the Census data. The latest
Census related to 2000-01. The report, which can be seen in www.wrmin.nic.in,
provides information on minor irrigation works like the type of works, crop-wise
utilisation of the potential created, the manner of distribution. NSC has stressed the need
for statistical analysis of the data with CWC and the MOWR, for users being made aware
of the reasons for variation between MOWR data and DESMOA data and reduction in
time lag of both data.
[Read Section 4.8, Chapter 4, NSC Report, pp. 134 – 136, & Annexure 4.7]
Data on forest cover is part of land-use statistics presented on the basis of a nine fold
land-use classification in ASG. Forest Survey of India (FSI) also collects data on forest
cover through a biennial survey by using Remote Sensing (RS) technology since 1987.
Digital interpretation has reduced the time lag in the availability of such data obtained
earlier through periodic reports from field formations. There are discrepancies between
ASG & FSI data on forest area due to differences in concepts and definitions. Data on
production of industrial wood, minor forest produce and fuel wood are available with the
Principal Chief Conservator of Forests in the Ministry of Environment & Forests.
The annual reports of National Bank for Agriculture and Rural Development
(NABARD) and its other publications like Statistical Statements Relating to the
Cooperative Movement in India and Key Statistics on Cooperative Banks, besides its
website and the RBI Handbook are useful sources of information on agricultural credit.
NAS provides data on the contribution of agriculture and its sub-sectors to GDP and
other measures of national/domestic product, value of output of various agricultural
crops, livestock products, forestry products, inland fish and marine fish and on capital
formation in agriculture and animal husbandry, forestry and logging and fishing.
109
[Read Sections 4.5 (pp. 129 – 130) and 4.17 (pp. 150 – 152), Chapter 4, NSC Report.]
5.6.1 Introduction
The industrial sector can be divided into a number of subgroups on the basis of
framework factors like coverage of certain laws, employment size of establishments or
criteria for promotional support by Government. Such groupings are the organised and
unorganised sectors, the factory sector (covered by the Factories Act, 1948), small-scale
industries, cottage industries, handicrafts, khadi and village industries (KVI), directory
establishments (DE) (those employing six or more persons), non-directory establishments
(NDE) (employing at least one person), own account enterprises (OAE) (sel employed).
Attempts have been made to get at a detailed look at the characteristics of some of these
sub-sectors of the industrial sector, as the data sources covering the whole sector often do
not provide information in such detail. Let us turn to the kind of data available for
individual subgroups and those that cover the entire industrial sector.
These sources provide levels of industrial employment. The first is the decennial
Population Census (2001 is the latest) providing data, up to the district level, on levels
of employment (i) by economic activities and broad occupational divisions and (ii) by
economic sectors, age groups and education. The time lag in availability is large. The
second is the quinquennial sample surveys relating to employment and unemployment
conducted by the National Sample Surveys Organisation (NSSO), the latest being for
2004-05. These also provide similar type of data on industrial employment up to State
levels within a year or two. Data by the 72 NSS regions are also possible with the unit
record data available on floppies from NSSO. The third is the Employment Market
Information Programme (EMIP) of the Directorate General of Employment &
Training (DGE&T), Ministry of Labour & Employment and the State Directorates
of Employment (SDEs), based on statutory quarterly employment returns from non
agricultural establishments in the private sector employing 10 or more persons and all
public sector establishments. (the organised sector). It provides data on employment in
the organised sector at quarterly intervals down to district levels (Quarterly Reviews) in
about a year’s time. Detailed data by economic activity are available in the Annual
Employment Reviews of the DGE&T and SDEs after a large time lag..
Economic Census (EC): Conducted by CSO since 1977, the latest (the fifth) EC was in
2005. It covers all economic enterprises in the country except those engaged in crop
production and plantation and provides data on employment in these enterprises, besides
providing a frame for the conduct of more detailed follow up (enterprise) surveys (FuS)
covering different segments of the uorganised non-agricultural sector. EC gathers basic
information on the number of enterprises and their employment by location, type of
activity and nature of operation. The all-India Report for EC 2005 (accessible on the
MOPSI website) and most of the State reports have been published.
The Annual Survey of Industries (ASI) launched in 1960 collects detailed industrial
statistics relating to industrial units in the country like capital, output, input, value added,
employment and factor shares and the survey has been conducted every year since 1960
except in 1972. The frames for the survey since 1998-99 consists of (i) all factories
registered under Sections 2m(i) and 2m(ii) of the Factories Act, 1948 employing 10 or
more workers using power as well as those employing 20 workers but without using
power and (ii) biri and cigar manufacturing establishments registered under the Biri and
Cigar Workers (Conditions of Employment) Act, 1966 with coverage of units as in (i)
above.
.
The reference period for the survey is the accounting year April to March preceding the
date of the survey. The sampling design and the schedules for the survey were revised in
1997-98, keeping in view the need to reduce the time lag in the availability of the results
of the survey. The survey does not attempt estimates at the district level. NIC 04 is used
for classifying economic activities from ASI 2004-05. Final results of ASI 2004-05 have
been released results relating to selected characteristics at various levels of aggregation
available in the ASI section of MOSPI website are: (i) all industries by States, (ii) all
India by 2-digit level of NIC 04 with rural-urban break-up, (iii) all India by 2/3/4- digit
level of NIC 04, (iv) States by 2/3/4- digit level of NIC 04 and (v) Unit level data with
suppressed identification, etc. Data for the past surveys are also available on the website.
CSO has also released time series data on ASI in 5 parts, each volume covering parts
of the period 1959 to 1997-98, which present data on important characteristics for
all-India at two-digit and three-digit NIC code levels and for the States at two-digit
NIC code levels. These publications are also available in electronic media on payment.
EPWRF (April, 2002) also provides time series ASI data on the principal
characteristics of the factory sector along with concepts and definitions used. These
are also on the website of EPWRF and on interactive CD ROMS.
The data available from ASI can be used to derive estimates of important technical ratios
like capital-output ratio, labour-output ratio, capital–labour ratio, labour cost per unit of
output, factor shares in net value added and productivity measures for different industries
as also trends in these parameters. The most important use of the detailed results arises
from the fact that these enable derivation of estimates of (i) the input structure per unit of
output at the individual industry and (ii) the proportions of the output of each industry
that are used as inputs in other industries, enabling us to use the technique of input-output
analysis to evaluate the impact of a change effected in (say) the output of an industry on
the rest of the economy. The construction of the I-O TT for the Indian economy is largely
based on ASI data.
[Read 1. ASI section of MOSPI website; 2. EPWRF (April, 2002); 3. Section 5.1,
Chapter 5, NSC Report, pp.162 – 173.]
CSO prepares and releases monthly indices of industrial production (IIP) and the
monthly use-based index of industrial production (base year 1993-94). The present
IIP with base year 1993-94 is a quantitative index based on production data received from
14 source agencies covering 543 items clubbed into 285 groups in the basket of items of
the index. The SDESs had been preparing IIPs for their respective areas but these were
not comparable with each other or with CSO’s national IIP because of differences in the
base year, basket of items, data and methodology used for constructing the indices. The
work of preparing State-wise IIPs comparable with the national IIP is at different stages
in different States and Union Territories. CSO releases Quick Estimates of IIP within
six weeks of the close of the reference month, in line with SDDS requirements. CSO has
released the IIP for December, 2007, the first revision of IIP for November 2007 and
the final IIP for September, 2007 through the Press Release of 12/2/09 (see MOSPI
website). NSC has a made a number of recommendations to improve the quality of IIP.
[Read MOSPI & EPWRF websites; Section 5.4, Chapter 5, NSC Report. Pp.187 – 200].
The Development Commissioner for Small Scale Industries (DCSSI) in the Central
Ministry of Small Scale Industries and the State Directorates of Industries provide data
on small-scale industrial units registered with the latter set of agencies. The DCSSI has
conducted a census of small scale industrial units thrice – the latest in November, 2002
(reference year 2001-02). The results of the third census are in the publication Final
Results: Third all India Census of SSI – 2001-02 of the Ministry of Small Scale
Industries. Broad details of the performance of small-scale industries are available in the
Annual Reports of the Ministry of Small Scale Industries. Time series data on
employment, production, labour productivity in small-scale industries (SSI) and value of
exports of the products of small-scale industry are also available in the RBI Handbook.
Data on some part of Khadi and Village Industries Commission (KVIC), handlooms
and handicrafts do get included in ASI but data relating exclusive to these sub-sectors are
available in the Annual Reports of these organisations or in the Annual Reports of the
Ministries under which these Boards/Commissions function.
112
How is the capital financed? Let us look at some of the sources that throw light on these
matters in the next sub-section.
[Read Sections 5.2 & 5.3, Chapter 5, NSC Report. Pp. 173 – 187.]
The RBI Handbook provides time series data on the sectoral deployment of non-food
gross bank credit provided by Scheduled Commercial Banks to different sectors of the
economy and also on the health of SSI and non-SSI units. The last category gives data on
sick and weak units for SSI and non-SSI sectors and the amounts outstanding (loans)
from each of these categories of units.
The ASI provides some data on financial aspects of industries – fixed capital, working
capital, invested capital, loans outstanding and also the interest burden of industrial units
(up to the 4-digit NIC code level). From where and how have the industries raised capital
needed by them? We have looked at once source of capital or working capital, namely,
bank credit. Time series data on new capital issues and the kinds of shares/instruments
issued (ordinary, preference or rights shares or debentures, etc.,) and the composition of
those contributing to capital (like promoters, financial institutions, insurance companies,
Government, underwriters and the public) are also presented in RBI Handbook. Also
available sare data on assistance sanctioned and disbursed by financial institutions like
Industrial Development Bank of India (IDBI) etc., and financig of project costs of
companies..The publication of the Securities Exchange Board of India (SEBI) Handbook
of Statistics on the Indian Securities Market - 2008 provides annual and monthly time
series data on industry-wise classification of capital raised through the securities market.
A reference to the two volumes of CMIE (EIS), Industry: Financial Aggregates and
Industry: Market Size and Shares would be rewarding. Section 5.9 deals with data on
foreign direct investment (FDI), another source of capital finance.
The National Accounts Statistics (NAS) presents a short time series of estimates of (i)
value of output and GDP of each two-digit NIC code level industry in the registered and
the unregistered sub-sector of the manufacturing sector, (ii) value of output of major and
minor minerals and GDP and NDP of the mining & quarrying sector, and (iii) GDP and
NDP of the sub-sectors electricity, gas and water supply.
113
5.7 TRADE
5.7.1 Introduction
Trade is the means of building up an enduring relationship between countries and the
means available to any country for accessing goods and services not available locally for
various reasons like the lack of technical know-how. It is also the means of earning
foreign exchange through exports so that such foreign exchange could be utilised to
finance essential imports and to seek the much-needed technical know-how from outside
the country for the development of industrial and technical infrastructure to strengthen its
production capabilities. Trade pacts or agreements between countries or groups of
countries constitute one way of developing and expanding trade, as these provide easier
and tariff-free access to goods from member countries. While efforts towards such an
objective will be of help in expanding our trade, globalisation and the emergence of
World Trade Organisation (WTO) have only sharpened the need to ensure efficiency in
the production of goods and services to compete in international markets to improve our
share of world merchandise trade and trade in services. Trade is also closely tied up with
our development objectives since trade deficit or surplus, made up of deficit/surplus in
merchandise trade and trade in services, contributes to current account deficit or surplus.
Data on trade in merchandise and services would enable us to appreciate the trends and
structure of trade and identify areas of strength and those with promise but need sustained
attention.
What would we like to know about foreign trade? The volume of trade, that is, the
volume of exports and imports, the size of export earnings, the expenditure on imports,
114
the size of exports relative to imports, earnings from exports compared to expenses
incurred in imports since exports earn foreign exchange while imports imply outflow of
foreign exchange. We should like to know about the trends in these variables. Besides
looking at the trends in the quantum and value of imports and exports, it is important to
analyse the growth in foreign trade both in terms of value and volume, since both are
subject to changes over time. Exports and imports are made up of a large number of
commodities and fluctuations in the export and imports of individual commodities
contribute to overall fluctuations in the volume and value of exports and imports. We,
therefore, need a composite indicator of the trends in trade. The index number of
foreign trade of a country is a useful indicator of the temporal fluctuations in exports and
imports of the country in terms of value, quantum and unit price and so on. Similarly,
measures of the terms of trade could be derived from such indices relating to imports and
exports. The existing index numbers have the base year 1978-79. RBI Handbook 2008
publishes time series data (DGCI&S data) on value (in US $ and Indian Rs.) of exports
and imports and trade balance, value of exports of selected commodities to principal
countries, Direction of Foreign Trade (in US $ Indian Rs.) by trade areas, groups of
countries and countries, year-wise Unit Value Indices (UVI) and Quantum Indices (QI)
for imports and exports and for each product and the three terms of trade measures, Gross
Terms of Trade (GTT), Net Terms of Trade (NTT) and Income Terms of Trade (ITT).
RBI also generates data on merchandise imports and exports or trade data. The Balance
of Payments (BoP) data reported by RBI (published in RBI Handbook) show the value
of merchandise imports on the debit side and that of exports on the credit side and also
trade balance, all in the balance payment format as part of current account, which also
shows another entity ‘invisibles’. However, there is a divergence in trade deficit /surplus
in merchandise trade shown by DGCI&S data and that shown by RBI’s BoP data. This
discrepancy also affects data on current account deficit (CAD) or surplus (CDS), since
CAD/CDS is the total of trade deficit/surplus and net invisibles. (see later for
‘invisibles’). There are three reasons for the divergence between the two sources. First,
DGCI&S tracks physical imports and exports while BoP data tracks payment transactions
relating to merchandise trade. Second, DGCI&S data fail to capture Government imports,
which are exemptrd from customs duty. (e.g. Defence imports). Finally, DGCI&S data do
not capture imports that do not cross the customs boundary (e.g. oil rigs and some
aircrafts) while they are still paid for and get captured in BoP data.
‘net’ against each of the items in the BoP table and (ii) separate figures for “software
services” under the category “ miscellaneous “.
5.7.4 E - Commerce
[Read 1. Notes on/Footnotes to Tables on Foreign Trade and BOP, RBI Handbook
(2008) and RBI Monthly Bulletin Feb., 2009; 2. Section 10.9 pp. (383 – 391) & 10.12
(pp. 396 – 398), Chapter 10, NSC Report.].
5.8 FINANCE
5.8.1 Introduction
While trade and finance have been closely bound up with each other ever since the time
money replaced barter as the means of exchange, finance is the lifeline of all activities. It
flows from the public as taxes to Government, as savings to banking and financial
institutions and as share capital or bonds or debentures to the entrepreneur. It then gets
used for a variety of development and non-development activities through Government
and other agencies and flows back to members of the public as incomes in various ways,
as factor incomes. It would, therefore, be of interest to know how funds get mobilised for
various purposes and get used. This section looks at the kind of data available that could
enable us to analyse this mobilisation process and the flows of funds to different areas of
activity. .
The finance sector consists of public finances, the central bank (the RBI), the scheduled
banks, urban and rural cooperative banks and related institutions. The financial market
consists of the stock exchanges dealing with scrips like shares, bonds and other debt
instruments, the primary and secondary markets, the foreign exchange market, the
treasury bills market and the securities market where financial institutions, mutual funds,
foreign institutional investors, market intermediaries, the market regulator the Securities
Exchange Board of India (SEBI), the banking sector and the RBI all play important roles.
There is also the unorganised sector made up of financial operators like money-lenders
and pawn brokers. Insurance is another area of finance.
What would we like to know about public finances? We would like to know how they are
managed. What are the sources of such finances and how and on what are they spent?
Does the Government restrict its expenditure within its means or does it spend beyond the
resources available to it? Does it, in the process, borrow heavily to finance its
expenditure? The Budget documents of the Central and State Governments, the pre-
Budget Economic Survey and the publication Indian Public Finance Statistics of the
Ministry of Finance, the Planning Commission’s Five Year Plan Documents and RBI
116
Handbook 2008 and the RBI Monthly Bulletins and EPWRF website provide a variety
of data on public finances.
The Economic Survey, for instance, gives an overall summary of the budgetary
transactions of the Central and State governments and Union Territory Administrations.
This includes the internal and extra-budgetary resources of the public sector undertakings
for their plans and indicates the total outlay, the current revenues, the gap between the
two, the manner in which the gap is financed by net internal and external capital receipts
and finally, the overall budgetary deficit. It gives the break-up of the outlay into
developmental and non-developmental outlays and the components of these and those of
current revenues. The RBI Handbook 2008 presents time series data in respect of public
finances in five groups – (i) Central Govt. Finances, (ii) Finances of the State Govts.,
(iii) Combined Finances of Central and State Govts., and (v) Transactions with the
Rest of the World. Besides covering data areas like Govt. receipts and expenditure, the
first two groups cover key deficit indicators, the financing of Gross Fiscal Deficit (GFD)
and outstanding liabilities. The third group covers in addition the range and weighted
averages of Central and State Govt. dated securities and the shares of categories of
holders of Central and State Govt. Securities.
We have looked at one area of the fourth group, namely, trade in merchandise and
services in the section on Trade. There are other areas in which India interacts with the
rest of the world. Foreign exchange flows into the country as a result of exports from
India, external assistance/aid/loans/borrowings, returns from Indian investments abroad,
remittance and deposits from NRI and foreign investment (FDI and portfolio investment)
in India. Foreign exchange reserves are used up for purposes like financing imports,
retiring foreign debts and investment abroad. What is the net result of these transactions
on the foreign exchange reserves? What are the trends in these flows and their
components? What is the size of the current account imbalance relative to GDP and its
composition? If it is a deficit, is it investment-driven? What is the size of foreign
exchange reserves relative to macro-aggregates like GDP, the size of imports and the size
of the short term external debt? While the Weekly Statistical Supplement and the RBI
Monthly Bulletin give data on forex reserves and related data, the RBI Handbook gives
time series data (in US$ and in Indian Rupees) on these parameters. As for FDI, the
coverage of data compiled by RBI and the Department of Industrial Policy &
Promotion (DIPP) in the Ministry of Commerce & Industry since is in accordance with
the best international practices since 2000-01. The RBI Handbook 2008, the SEBI
Handbook 2008 and the RBI Monthly Bulletin, Feb. 2009 and the websites of
www.rbi.org.in and www.dipp.nic.in provide time series data on FDI.
Economic transactions need a medium of exchange. We have come a long way from the
days of barter and come to the use of money and equivalent financial instruments as the
medium of exchange. Banks function as important financial intermediaries not only in
this process but also in matters of resource mobilisation and the deployment of such
resources. The central bank of the country (RBI in India) regulates the functioning of the
banking system. In addition, it issues currency notes and takes steps to regulate the
money supply in the economy to achieve simultaneously the objectives of ensuring
adequate credit to development activities and maintain stability in prices. We should,
therefore, be interested in data on money supply or the stock of money and its structure
117
and the factors that bring about changes in these, the kind of aggregates that need
monitoring, the transactions in the banking system in pursuance of the nation’s
development objectives, the flow of credit to different activities, indicators of the health
and efficiency of banks which are the custodians of the savings of the public. We should
also be interested in data on prices, as price level affects the purchasing power of money
and indices of prices appropriate for the purpose/group in question – prices for producers
and consumer prices for different groups of consumers.
Most of these of data are compiled by the RBI on the basis of its records and those of
NABARD and returns that it receives from banks and can be found in RBI Bulletins and
RBI Handbook. These are also published in the Monthly Abstract of Statistics of
CSO. The Wholesale Price Index (WPI) (base year 1993-94) is compiled by the
Economic Adviser’s Office in the Ministry of Industry, the Consumer Price Index for
Industrial Workers (CPI - IW) (base year 2000) and CPI for Agricultural Labour (CPI -
AL) (base year 1986-87) by Labour Bureau), Shimla, and CPI for Urban Non-Manual
Employees (CPI - UNME) (base year 1984-85) by CSO and published by the agencies
concerned (also posted on their websites) and are available in the RBI and CSO
publications mentioned above. The Monthly Abstract of Statistics also gives monthly
data on average rural retail prices of (i) selected commodities/services and (ii)
controlled/rationed items collected by NSSO. Two other reports of the Reserve Bank of
India published every year – the Report on Currency and Finance and the Report on
Trends in Banking provide a wealth of information of use to analysts. EPWRF website
provides data on Banking, Money and Finance.
What would we like to know about financial market and their functioning? We would
like to know about the ways in which financial resources can be accessed and at what
cost. What are the prevailing interest rates payable for funds to meet short-term or long-
term requirements? How do new ventures access the large amount of resources that are
needed for the new ventures? How do term lending institutions access funds required for
their operations? What are the sources of funds?
RBI, which regulates banking operations and the operations of NBFCs and FIs, SEBI,
which regulates the capital market and the Department of Company Affairs, which
administers the Companies Act, are the major sources of data on financial markets. The
RBI Handbook 2008 (also RBI website) and SEBI’s Handbook of Statistics on the
Indian Securities Market 2008 (also, www.sebi.gov.in) contain comprehensive data on
financial markets. The two together provide time series data on several aspects of
financial markets. Examples are the structure of interest rates, resource mobilisation in
the Private Placement Market, net resources mobilised by mutual funds (MFs), new
capital issues by non-govt. public ltd. companies, absorption of private capital issues –
the no. of issuing companies, the number of shares and amount subscribed by various
categories of subscribers, annual averages of share price indices, resources raised by the
corporate sector through equity/debt issues, the share of private placement in total debt
and total resource mobilisation, the share of debt in total resource mobilisation; pattern of
funding for non-govt. non-financial public limited companies, capital raised by economic
activity/size of capital raised/region, trends on trading on stock exchanges, indicators of
liquidity - market capitalisation-GDP ratio (BSE & NSE), turnover ratio (BSE) and
118
traded value ratio (BSE & NSE) and comparative evaluation of indices (BSE SENSEX
etc.) through Price to Earnings Ratio and Price to Book Ratio.
[Read Sections 10.1 to 10.11, Chapter 10, NSC Report, pp.337 – 396; Article “New
Monetary and Liquidity Aggregates”, RBI Bulletin of November, 2000.]
5.9.1 Introduction
Social Sector consists of education, health, employment, environment and levels of living
or quality of life in general. Investments in this sector pay rich dividends in terms of
rising productivity, distributed growth, reduction in social and economic inequalities and
levels of poverty, though after a relatively longer time span than in the case of investment
in physical sectors. Let us look at the kind of data available in this sector.
Employment is the means to participate in the development process and also benefit from
it. Creation of employment opportunities is an important instrument for tackling poverty
and to empower people, especially women. We should, therefore, know how many are
employed and how many are ready to work but are unable to gain access to employment
opportunities. How do women fare in these matters? Or, for that matter, what is the
experience of men and women belonging to different social/religious/disadvantaged
groups? Are children employed in any economic activity that is not only hazardous to
their health but which also adversely impacts on our dream of a golden future for them
through efforts to ensure their mental and physical well being. What is the quality of
employment opportunities available to the work force? What are the conditions in which
people work?
Data on employment in selected sectors are available from several sources. We have
already looked at EMIP of DGE&T. This is the only source of employment data on the
organised sector of the economy that is available at quarterly intervals but is subject to
the limitations arising out of non-response in the submission of returns and
incompleteness of the employers’ register (the frame). EMIP also produces a biennial
reports on the occupational and educational pattern of employment in the public and
private sectors, based on another return but these have lost their utility over the years due
to a high level of non response.
The DGE&T and the SDEs also provide data on the number of jobseekers by age, sex
and educational qualifications and type of social/physical handicap on the live register of
employment exchanges. All the registered jobseekers are not unemployed and all the
unemployed do not register with the employment exchange, registration being voluntary.
Some register at more than one exchange. The size of the live register cannot be an
accurate estimate of the level of unemployment. It does represent the extent of pressure in
the job market, especially for Govt. and public sector jobs.
119
The second and third sources, EC and ASI, have also been discussed earlier. The quality
of ASI data is tied to the completeness of the frame of factories, which, in turn depends
on the quality of the enforcement of the two relevant Acts. Data on employment in the
Railways and the Banking sector are available from Railway Board and RBI
respectively. The Indian Labour Statistics (ILS) (the latest is for 2006) published by
the Labour Bureau (LB), Ministry of Labour & Employment, Shimla presents data on
employment in a number of sectors that are covered by different labour legislations, but
all these suffer from inadequacy of response in the submission of statutory returns and
partial coverage of the relevant Act.
The latest NSSO quinquennial EUSs relates to 2004-05 (61st Round). Its results are
published in NSSO reports numbered 515 to 521. Report no. 515 Parts I & II
“Employment & Unemployment Situation in India” is the overall report. These
provide data on employment and unemployment in greater detail than the Census up to
the State level. Analyse at the level of the 72 NSS regions is possible with the unit record
data obtainable from NSSO. EUSs provide data on the distribution of the
employed/unemployed by a number of characteristics like sex, rural/urban residence, age,
education, employment status, economic activity (of the employed) and the monthly per
capita expenditure class (MPCE) of the household concerned and also data on the
incidence of underemployment, average daily wage levels of workers, etc.
EUS data throws light on aspects of quality and adequacy of employment, like
underemployment of the employed, the share of casual employment in employment,
male-female/regular/casual worker wage differentials and employment status of the
workers from poor households. ILS of LB provide data on wage/earnings in the
organised sector based on statutory returns and the distribution of workers in different
occupations by level of earnings in selected industries through Occupational Wage
Surveys (OWS). (Some of the sixth round reports have come out in 2008). ASI gives
data on wages/ emoluments in different industries in the factory sector. DE, NDE and
OAE and other Establishment Surveys provide data on average annual earnings for
men, women and children in the unorganised sector. Wage Rates in Rural India for
2005-2006 and the 2005 Report on the Working of the Minimum Wages Act, 1948
give rural/unorganised sector wage levels vis-à-vis statutory minimum wages. Data on
child labour and bonded labour from the Census and NSSO do not fully reflect the
ground level realities.
120
ILS also publishes data on several aspects of labour welfare like industrial injuries,
compensation to workers for injuries and death, industrial disputes and access to health
insurance and provident fund. These are incomplete, being based on statutory returns.
ILS also gives statistics relating to welfare funds set up in different industries. LB’s
reports on their ongoing programme of surveys throw light on the working and living
conditions of Scheduled Caste/Tribe workers, unorganised workers and contract labour.
5.9.3 Education
(a) Introduction
The Department of School Education and the Department of Higher Education of the
Ministry of Human Resources Development (MHRD), the National Council for
Educational Research and Training (NCERT) and the University Grants Commission
(UGC) collect and publish educational statistics and conduct research studies and surveys
in the area of education. Selected Educational Statistics (2004-05), Education in India
-Vols. I & IV (1998-99) and Annual Financial Statistics of Educational Sector (2005-
06) published annually by MHRD, the All India Educational Survey (Seventh AIES) –
(2002-03) of the NCERT, Annual Report of the DGE&T, the National Technical
Manpower Information System (NTMIS) and annual Manpower Profile of India of
the Institute of Applied Manpower Research (IAMR), the 52nd , 53rd, , 55th and 61st
Round surveys of NSSO (1995-96, 1998-99, 1999-00 and 2004-05), the B and C series
tables of Census 2001 and the Planning Commission’s National Human Development
Report, (NHDR), 2001 and the State HDR reports are the major sources of data on
education. .
(b) Educational Infrastructure: These volumes together give data on the number of
institutions established at various levels – schools, colleges, polytechnics, industrial
training institutes and facilities for apprenticeship training, their intake capacity, teaching
positions created and filled, training facilities for the physically disadvantaged, special
121
campaigns like Sarva Siksha Abhiyan and Total Literacy Campaign, adult education,
availability and adequacy of physical facilities in schools – type of buildings, number of
rooms for teaching vis-a-vis the number of pupils, access to drinking water and toilets
and urinals (and separately for girls}, distance of the school from the pupil’s residence
and direct and indirect expenditure on education.
(c) Infrastructure Utilisation and Access to Educational Opportunities: These
volumes provide data on literacy rates – overall and for different groups, enrolment,
enrolment ratios and drop out rates at different levels of education and courses, teacher-
pupil ratios, output from various professional and non-professional courses, utilisation
patterns of professional manpower, stocks of different categories of manpower, the
educational profile of the population, incidence of disability in the population, output of
training facilities created for the vocational rehabilitation of the physically challenged
and their vocational rehabilitation.
[Read . 1. Sub-Paper 4.2 of Paper on Agenda Item 4, 15th COCSSO, pp.18 – 35;
2. Section 9.5, Chapter 9, NSC Report, pp. 306 – 323.]
5.9.4 Health
(a) Introduction: One of the important dimensions of quality of life is health. A healthy
individual can contribute effectively to production of goods and services. Investment in
health is, therefore, an essential instrument of raising the quality of life of people and the
productivity of the labour force. What is the health status of the population? What are the
challenges to the health of the population and how are these being tackled? What kind of
data is available about these aspects of the population, the health infrastructure and the
efforts being made to deal with problems of health? What is the impact of these on the
health situation, especially of women and children? The annual publication National
Health Profile of India (NHPI) (2007 is the latest) of the Central Bureau of Health
Intelligence (CBHI) of the Ministry of Health & Family Welfare (MHFW), the
publications Sample Registration System (SRS): Statistical Reports, SRS
Compendium of India’s Fertility & Mortality Indicators, 1971-1997, Mortality
Statistics and Cause of Death and SRS Bulletin (half yearly) and the Social &
Cultural Tables (C Series Tables) of Census 2001 of the Registrar General India &
Census Commissioner (RGI&CC), the Planning Commission’s NHDR 2001 and the
HDRs of the State Governments and the Report on the National Family Health
Survey (NFHS – 3): 2004-05 of the International Institute of Population Sciences
contain a large amount of information on these aspects of health.
(b) Health Infrastructure: NHPI provides data on the number of public and private
hospitals, dispensaries and the number of beds in these in rural/urban areas, similar data
on various health insurance schemes of the Government in different sectors/for different
sections of population, facilities for Indian Systems of Medicine, facilities for training
medical and health manpower, manning of medical and health positions in the health
system, stocks of medical and health manpower, programmes for controlling specific
communicable diseases and expenditure on health and family welfare.
(c) Public Health, Morbidity and Mortality Indicators: NHPI presents data on
programme for vaccination of children and pregnant women, incidence of communicable
122
and other diseases and mortality due to these, incidence of leprosy and tuberculosis,
National Aids Control Programme and other National Control/Eradication programmes,
levels of utilisation of different health insurance schemes, infant mortality, maternal
mortality, birth and death rates, fertility rates, incidence of disability and expectation of
life.
(d) National Family Health Survey (NFHS)-3: NFHS – 1, 2 & 3 conducted in 1992-93,
1998-99 and 2005-06 succeeded in building up an important demographic and health
database in India. These provide State-level estimates of demographic and health
parameters and also data on various socio-economic and programmatic factors that are
crucial for bringing about desired changes in India’s demographic and health situation.
NFHS – 3 covers all states. Some of the types of data provided are (website
http://www.nfhsindia.org) data on age at first marriage of women, current fertility,
median age of women at the first and last birth of child, knowledge and practice of
contraception, estimates of age-specific death rates, crude death rates, infant/child
mortality rates, morbidity of selected diseases, immunization of children, vitamin A
supplementation of children, nutritional status of children, anaemia among them and
indicators of acute and chronic malnutrition among children – weight for age index,
height for age index and weight for height index, health status of women (Body Mass
Index and prevalence of anaemia), health problems of pregnancy and so on.
5.9.5 Environment
The process of development adversely affects the environment and through it, the quality
of life of society. For instance, the excessive use of fertilizers and pesticides rob the soil
of its nutrients. Letting sewers and drainage and industrial effluents without prior
treatment into rivers and water bodies pollute these, causing destruction of aquatic life
and endangering the health of people using such polluted water. The recent outcry in
Tirupur, near Coimbatore, a place well known for garment exports, against untreated
effluents from garment factories being let into the river used for drinking purposes is a
case in point. The exhaust fumes containing Carbon Monoxide (CO) and lead (Pb)
particles let in to the air we breathe by vehicles using petrol or diesel is an example of air
pollution. The best example of industrial pollution through insufficient safety measures is
the Bhopal gas disaster where lethal gases leaking from a factory’s storage cylinder killed
many people immediately and maimed many others for life. The forest cover of the
country is continuously getting reduced due to indiscriminate felling of trees leading to
reduction in rainfall and changes in rainfall pattern, besides climatic changes. The
destruction of mangroves along seacoasts for housing/tourism development often leads to
soil erosion along the coast by the sea. The adverse effects of current models of
development on environment and the realisation of the need to take note of the cost to
development represented by such effects have now led to the development of
environmental economics as a new discipline in economics.
The Central and State Pollution Control Boards and the Ministry of Environment and
Forests (MOEF) evolve and monitor implementation of policies to protect the
environment. Statistics on environment are collected through this process by these
agencies and the CSO. The annual reports of the MOEF and the Compendium on
123
Environment Statistics, India 2007 published by the CSO from time to time are
excellent sources of data on environment. The latter especially is very comprehensive and
includes a very informative write up. The Compendium (and the annual report of MOEF)
can be accessed in the websites of the two organisations. Illustrative types of data on
environment available from these publications are Ambient Air Quality Status
[concentration of Sulphur di-oxide, Nitrogen di-oxide and Solid Particulate Matter (SPM)
in air] in major cities of India, percentage of petrol-driven two-wheelers, three-wheelers
and four-wheelers meeting CO emission standards, and water quality of Yamuna river (in
the Delhi stretch) in respect of selective physio-chemical parameters during a year –
dissolved oxygen (milligrams./litre), Biological Oxygen Demand (BOD) (mg./l), faecal
coliforms (number/100ml), total coliforms (number/100ml) and ammonical nitrogen
(mg/l). (SPM consists of metallic oxide of silicon, calcium and other deleterious metals.
The most common contamination in water is from disease-bearing human wastes that are
usually detected by measuring faecal coliform levels.)
We have already looked at several of the factors determining the quality of life of the
people – education, health, employment and environment. Shelter and amenities is
another. Ironically, development projects also displace people from their normal way of
life. One other factor, an important one, is the level of income or consumption. The
relevant data are available from quiquennial surveys of the NSSO on consumer
expenditure - those on levels of consumption for different MPCE classes. These and
others considered in the earlier sections lead to measures of the dimensions of poverty
and inequalities in income (consumption) and non-income aspects of life, HDIs, Gender
Development Indices (GDI) measuring gender discrimination, BMIs evaluating the health
status of women and the measures, Weight for Age Index, Height for Age Index and
Weight for Height Index gauging the nutritional status of children. All these measures are
also available from NSSO, the NHRD Report 2001 and the State HDI Reports for
judging the quality of life of the population and of the Scheduled Castes and Tribes. The
Social Statistics Division of CSO has a number of regular publications presenting data on
the elderly, gender differentials in different areas, progress towards millennium
development goals and a report on home based workers. (see MOSPI website.) The
Sachar Commission Report on Minorities and the reports of the Commissions for (i)
Scheduled Castes and Tribes, (ii) Backward Classes Commission, (iii) the Minorities and
(iv) the Women at the national and state levels review improvements in the quality of life
of these sections of society and their mainstreaming, empowerment and physical and
emotional safety and in so doing assemble an enormous amount of data from various
sources. Likewise, the Commissions for the Aged and for Children are sources of data on
these groups gathered at one place from various primary sources. The Annual Reports of
the Ministry of Social Justice and Empowerment, the nodal agency for the
welfare/development/empowerment of all these groups and the physically and mentally
challenged, is another source of data on the status of these vulnerable groups in society.
[Read 1. Sections 9.6 (pp. 323 – 327) & 9.8 (pp. 331 – 336), Chapter 9, NSC Report. ]
124
The Indian Statistical System with the National Commission on Statistics at the apex
collects, compiles and disseminates an enormous amount of data on diverse aspects of the
Indian economy. Sections 5.4 to 5.10 have highlighted the characteristics of the database
on different aspects of the Indian economy, looking specifically at certain major ones.
The observations of NSC on these data areas and their suggestions can be seen in the
reading material cited under each Section/Sub-section. The entire NSC Report should in
fact serve as a guide to anyone wishing to know about data in any area and their quality.
build a data system that delivers quality data that is reliable and timely. The National
Commission on Statistics and the Indian statistical system working under its guidance are
already taking steps in this direction. As a researcher and analyst requiring statistical data
for your work it would be useful to be in touch with developments relating to the review,
refinement and expansion of the database in different aspects/sectors of the economy.
• The Journal of Income and Wealth of the Indian Association for Research in
National Income and Wealth is useful for those interested in methodological
developments in the field of National Accounts and examination of questions of
adequacy or suitability of available data for use in National Accounts work. For
instance, the Journal’s recent issues (Issue No. 24 – 1&2) have a Paper A Case
Study on Estimation of Green GDPof Manufacturing Sector in India by S.K.Nath
& Samiram Mallick that would be relevant in the context of the emerging
emphasis on environment-friendly industrial development. Another Paper
Services Sector in the Indian Growth Process: Myths and Reality by Sanjay
Kumar Hansda would be useful in the current context of the perceived dominance
of the services sector in GDP.
• The discrepancy between estimates of PFCE made by CSO and those made from
NSS household surveys has been the subject of discussion for a long time. The
Report on the Cross Validation Study on Estimates of Private Consumption
Expenditure Available from Household Surveys and National Accounts prepared
by CSO and NSSO for the Study Group on Non - Sampling Errors is published in
Sarvekshana, (Issue 88, Vol. XXV XXVI, No. 41, pp. 1 – 69.). Also see in this
connection Section 13.4 (pp. 492 to 506) and in particular sub-sections 13.4.7
(about the above Study)(pp. 503 – 506), NSC Report.
• A clear understanding of the concepts used in the surveys of NSSO would be very
useful for an analyst/researcher. Explanations of technical terms, their definitions
and the underlying concepts in NSS socioeconomic surveys up to the 55th Round
(excluding the terms used in ASI, price collection work and crop surveys) are
given in Concepts and Definitions Used in NSS (May, 2001). Modifications made
in definitions etc., in recent Rounds (60 onwards) are available Round wise. See
NSSO/SDRD section of the MOSPI website.
• National Seminar on NSS 61st Round Results (October, 2007) Report on the
Seminar and Papers are in the MOSPI website.
5.12 REFERENCES
18. From where and how can you get data on per-capita availability of milk, egg,
wool, cow and buffaloes? What do you think of its timeliness and reliability?
19. What kind of data on inequalities is available from the Agricultural Census?
Comment on the uses to which such data this can be put.
20. What do you mean by the term 'cropping intensity'?
21. What data is available on subsidy given to agriculture? Comment on its accuracy.
Attempt to work out measures like those relevant to subsidy worked out in OECD
countries and presented in ASG 2004. (Producer Support Estimate – PSE and
%PSE).
22. What are the major efforts made to collect data on different aspects of ag. Sector?
23. Discuss the characteristics of data flowing from the agencies involved in
compilation of agricultural data.
24. Indicate the major sources of data on levels of industrial employment. Comment
on their scope, coverage, reliability and timeliness.
25. Discuss the role of the Economic Census in the industrial database.
26. Discuss the kinds of data that ASI provides to an understanding of different
aspects of the factory sector. What are its contributions to economic analysis?
27. Discuss the adequacy, quality and the representative character of the Index of
Industrial Production (IIP).
28. Discuss the limitations of the time series data on small scale industries available
from DCSSI.
29. Enumerate the data sources on flow of credit to different sub sectors of industry.
30. "The detailed data available in the industrial sphere relates to the factor sector"
explain.
31. Do you think that data available on the unorganised sector is inadequate? What
suggestion would you like to make in this regard?
32. Which are the major two sources of data on merchandise trade? What kind of
Trade Data is compiled by the two sources?
33. How are the measures Gross Terms of Trade, Net Terms of Trade and Income
Terms of Trade obtained?
34. What are the reasons for divergence between the two sources of trade data?
35. Explain the terms ‘net invisibles' and ‘non factor services’. To what detail is data
on trade in services available and where?
36. Indicate the documents that provide different kinds of data on public finances
37. Indicate the various measures of deficit and the relationship between them.
38. Identify the transactions other than trade in merchandise India has with the rest of
world?
39. Which document does contain the methodology for compiling liquidity
aggregates?
40. List the kind of time series data available in RBI Handbook 2008.
41. Identify the major sources of data on financial markets.
42. Name the sources that provide data on employment and unemployment.
43. Discuss the scope, reliability and utility of EMIP and live register data as a source
of data on levels of employment and levels unemployment respectively.
43 Explain the kind of data on employment and unemployment available from the
population census 2001.
129
Structure
6.0 Objectives
6.1 Introduction
6.2 An overview of the block
6.3 SPSS Package
6.3.1 Features of SPSS for Windows
6.3.2 Getting acquaintance with SPSS
6.3.3 Menu commands and Sub-commands
6.3.4 Basic steps in Data analysis
6.3.5 Defining, Editing and Entering Data
6.3.6 Data file management functions
6.3.7 Running a Preliminary Analysis
6.3.8 Understanding Relationship between Variable: Data Analysis
6.3.9 SPSS Production Facility
6.4 Statistical Analysis System (SAS)
6.5 NUDIST
6.6 EVIEWS Package
6.6.1 EVIEWS Files and Data
6.6.1.1 Creating a Work file
6.6.1.2 Importing Time Series Data from Excel
6.6.1.3 Transforming the Data
6.6.1.4 Copying Output
6.6.1.5 Examining the Data
6.6.1.6 Displaying Correlation and Covariance Matrices
6.6.1.7 Seasonality of the series
6.6.1.8 Estimating Equations
6.6.1.9 Testing for Unit Roots
6.6.1.10 ARIMA/ARMA identification and estimation
6.6.1.11 Granger Causality Test
6.6.2 Vector Auto Regression (VAR)
6.0 OBJECTIVES
6.1 INTRODUCTION
SPSS (Statistical Package for Social Sciences) is one of the packages often preferred by
the researchers and analysists for data management and detailed statistical analysis. It
provides techniques for data processing and procedures for logistic regression, log-linear
analysis, multivariate analysis and analysis of variance. It also equips us with the
procedures for constrained non-linear regression obit, Cox and actuarial survival analysis.
Further, it also creates high quality presentation of data. SPSS also performs
comprehensive forecasting and time series analysis with multiple curve fitting models,
smoothing models and methods for estimating of autoregressive functions. In the first
part of this block, efforts have been made to introduce the various operational procedures
relating to data processing, statistical analysis and presentation of data.
Once the data has been collected, the first step is to look at it in a variety of ways. While
there are many specialized software application packages for different types of data
analysis (relating to scientific, commercial and financial problems), a researcher is often
faced with a situation where the general treatment and standard statistical analysis of the
quantitative data is required. SPSS (Statistical Package for Social Sciences) is one-such
package that is often used by researchers and analysts for data management and exploring
it before attempting a detailed statistical analysis. It is a preferred choice for research
analysis due to its easy-to-use interface and comprehensive range of data manipulation
and analytical tools.
132
Suppose, you are interested in knowing the attitude of students towards distance
education and for that you have administered a data collection instrument (commonly
known as questionnaire) to some students. Now you want to process and analyze the data.
Till recently, data were processed manually and it was indeed a cumbersome process.
Fortunately, now we live in an age when high-speed computers can do the job of
processing and analysis of data in a very short period of time and of course without
errors. What you have to do is to learn some fundamental concepts used in this
programme. Now you can sit at the computer and process and analyze the data that you
have collected by administering a questionnaire. In fact, you will find it helpful and
interesting to keep the SPSS Application guide nearby while you process and analyze
your data.
Here to help you work with the SPSS, some general features are highlighted next.
6.3.1 Features of SPSS for Windows
SPSS is one of the leading desktop statistical packages. It is an ideal companion to the
database and spreadsheet, combining many of their features, as well as, adding its own
specialized functions. SPSS for Windows is available, as a base module and a number of
optional add-on enhancements are also available. Some versions present SPSS as an
integrated package including the base and some important add-on modules.
SPSS Professional Statistics provides techniques to examine similarities and
dissimilarities in data and to classify data, identify underlying dimensions in a data set. It
includes procedures for cluster, k-cluster, discriminate, factor, multi- dimensional scaling,
and proximity and reliability analysis.
SPSS Advanced Statistics includes procedures for logistic regression, log-linear analysis,
multivariate analysis and analysis of variance. This module also includes procedures for
constrained non-linear regression, probit, Cox and actuarial survival analysis.
SPSS Tables create a high quality presentation – quality tabular reports including stub
and banner tables and display of multiple response data sets. The new features include
pivot tables, a valuable tool for presentation of selected analytical output tables.
SPSS Trends perform comprehensive forecasting and time series analysis with multiple
curve fitting models, smoothing models and methods for estimation of autoregressive
functions.
SPSS categories perform conjoint analysis and optimal scaling procedure, including
correspondence analysis.
SPSS also provides simplified tabular analysis of categories data, develops predictive
models, screens out extraneous predictor variables, and produces easy-to-read tree
diagrams that segment a population into sub-groups that share similar characteristics.
Recently, the SPSS Corporation announced the release of SPSS version 15.0. Many new
add-on products have also been launched in the recent months. You can consult the SPSS
World Wide Web site for the latest developments and additions to the computing power
of SPSS. Technical support is also available to the registered users at the SPSS site. The
SPSS Web site is http://www.spss.com. Select white papers on SPSS applications in
major disciplines are also available on this site.
The present unit discusses some of the commonly used data management techniques and
statistical procedures using SPSS 11.5 version. Since new features are added almost
133
daily, you are advised to check for these details on the currently installed version of SPSS
on your computer and also consult the user manuals before undertaking complex type of
data analysis. The on-line help is also available. There may be some procedures and
syntax-related changes from one version to another. In case these are not available on
your version of SPSS, please consult the relevant SPSS authorized representative or the
WWW site of the SPSS corporation. The most recent version of SPSS is now called
PASW.
With this basic knowledge let us get acquainted with the SPSS.
6.3.2 Getting Acquaintance with SPSS
The SPSS for Windows can be run from Windows 3.x or Window 95 through Windows
98 or later operating systems such as UNIX, Mac and mainframe versions of the SPSS
software. These are also available on the SPSS software. The illustrations in this unit are
based on SPSS version for Window 95/98/NT operating systems. We are assuming that
SPSS is installed on your machine.
Starting SPSS
The SPSS for Windows uses graphical environment, descriptive menus and simple dialog
boxes to do most of the work. It produces three type of files, namely data files, chart files
and text files.
To start SPSS, click the start button on your computer. On the start menu that appears,
click Program. Another menu appears on the right of the start menu. If there is an entry
marked SPSS, that’s the one you want to click. If there isn’t, click the program group
where SPSS was installed and an entry marked SPSS will appear. Click the SPSS 11.4
(or which ever version entry). You will know when the SPSS has started and an SPSS
Data Editor window appears. To begin with, the SPSS data editor window will be empty
and a number of menus will appear on the top of the window. We can start the operations
by loading a data set or by creating a new file for which data is to be entered from the
data editor window. The data can also be imported from other programs like Dbase,
ASCII, Excel and Lotus, we will learn about this in a little while from now.
Exiting SPSS
Make sure that all SPSS and other files are saved before quitting the program. You
should exit the software by shutting off the program by selecting Exit SPSS command
from the file menu of the SPSS Data Editor window. In case of unsaved files, the SPSS
will prompt you to save or discard the changes in the file.
Saving data and other files
Many types of file can be saved using ‘save’ or ‘save as’ command. Various types of file
used in SPSS are: Data, -Syntax, Chart or Output. Files from spreadsheets or other
databases can also be imported by following the appropriate procedure. Similarly, an
SPSS file can be saved as a spreadsheet or in Dbase format. Select the appropriate save
type command and save the file. The SPSS data files are saved with .sav as the secondary
name. Though SPSS files could be given any name, the use of reserved words and
symbols is to be avoided in all types of file names.
Printing of data and output files
The contents of SPSS data files, Output Navigator files and Syntax Files can be printed-
using the standard ‘Print’ Command. The SPSS uses the default printer for printing. In
the case of network printers, an appropriate printer should be selected for printing the
134
output. It is suggested that ink jet or laser jet printers should be used for printing graphs
and charts. Tabular data can be easily printed using a Dot matrix Printer.
Operating Windows in SPSS
There are seven type of Windows in SPSS which are frequently referred to during the
data management and analysis stages. These are:
Data Editor
As mentioned earlier, the data editor window opens automatically as soon the SPSS gets loaded.
To begin with, the data editor does not contain any data. The file containing the data for analysis
has to be loaded with the help of ‘file’ menu sub-commands by using various options available
for this purpose. The contents of the active data file are displayed in the data editor window. Only
one data editor window will be active at a time. No statistical operations can be performed until
some data is loaded into data editor.
Output Navigator
All SPSS messages, statistical results, tables and charts are displayed in the output navigator. The
output navigator can be opened/closed using the File Open/New Command. The output in the
navigator window can be edited and saved for future reference. The Output Navigator opens
automatically, the first time some output is generated. The user can customize the presentation of
reports and tables displayed in the Output Navigator. The output can be directly imported into
reports prepared under word processing packages, and the output files are saved with an
extension xxxx.spo.
Pivot Tables
The output shown in the Output Navigator can be modified in many ways using the Edit and
Pivot Table Option, which can be used to edit text, swap rows and columns, add colour, prepare
custom made reports/output, create and display selectively multi-dimensional tables. The results
can be selectively hidden and shown using features available in Pivot Tables.
Graphics
The Chart Editor helps in switching between various types of charts, in swapping of X - Y axis,
changing colour and providing facilities for presenting data and results through various type of
graphical presentations. It is useful for customizing the charts to highlight specific features of the
charts and maps.
Text Editor
The text output not displayed in the Pivot Tables can be modified with the help of Text Editor. It
works like an ordinary Text Editor. The output can be saved for future reference or sharing
purposes.
Syntax Editor
The Syntax Editor can be opened and closed like any other file using the File Open/New
command. The use of Syntax File is recommended when the same type of analysis is to be
performed at frequent intervals of time or on a large number of data files. Using Syntax File for
such purposes automates complex analysis and also avoids errors due to frequent typing of the
same command. The commands can be pasted on the Syntax files using a particular command
and pastes buttons from the menu. Experienced users can directly type the commands in the
Syntax window. To run the syntax, select the commands to be executed and click on the run
button at the top of the syntax window. All or some selected commands from the Syntax File will
be executed. The Syntax File is saved as xx.sps.
135
Script Editor
This facility is normally used by the advanced users. It offers fully featured programming
environment that uses the Sax BASIC language and includes a Script Editor, Object Browser,
Debugging features and context sensitive help. Scripting allows you to automate tasks in SPSS
including:
• Automatically customizing output
• Open and save data files
• Display and manipulate SPSS dialog boxes
• Run data transformation and statistical procedures using SPSS command Syntax
• Export charts as graphic files in a variety of formats.
The present module will not go into the details of the advanced features of SPSS including
scripting. .
Figure 6.1: SPSS data editor shows the data editor menus. Each command in the main menu has a number of
sub-commands.
Table 6.1 : Components of data editor menu
Menu Function/sub-commands
File Open and Save data file, to import data created in other formats like Lotus,
Excel, Dbf etc. Print control functions like page setup, printer setup and
associated functions. ASCII data can also be read into SPSS.
Edit These functions are similar to those available in general packages. These
include undo, redo, cut, copy, paste, paste variable, find, find and replace.
Option setting for the SPSS is controlled through Edit menu.
136
View Customize tool bars, Fonts, grid and display of data, displays option for
showing value labels.
Data This is a very important menu as far as management of the data is concerned.
Variable definition, inserting new variables, transposing templates, aggregating
and merging of data files, splitting data files for specific analysis are some
important commands in Data Menu.
Transform Compute new variables, recede, random number generation, ranking, time
series data transformation, count and missing value analysis are undertaken
using Transform Command.
Analyze As the name implies, analyze Menu incorporates statistical procedures,
frequency distribution, cross-tabulations, comparison of means, correlation,
simple and multiple regression, ANOVA, Log linear regression, discriminate
analysis, factor analysis, non-parametric tests and time series analysis are
undertaken using analyze menu.
Graphs Includes options for generating various type of custom made graphics like bar,
pie, area, X- Y and high-low charts, pareto, control charts, box-plots,
histograms, P-P and Q-Q charts and time series representation of data.
Utilities Information about variables, information on working a data file, run scripts and
define sets are some of the important functions carried out through Utilities
command.
Window Windows menu are used to switch between SPSS windows.
Help Context specific help through dialog boxes, demo of the software, and
information about the software are some of the important options under Help
command. It provides a connection for the SPSS home page. The statistical
coach included in the help module is very useful in understanding various
stages of executing a procedure.
Setting The Options
The SPSS provides a facility for setting up of the user defined options. Use the Edit menu and
then select Options. The following types of optional setting are allowed in SPSS as illustrated in
Figure 6.2. Make the appropriate changes to set the options according to your choice.
With this basic knowledge about commands and sub-command now let us learn about the basic
steps in data analysis.
Step 1
Define your data
Step 2
Get your data entered
into SPSS data Editor
Step 3
Select the variables
for the analysis
Step 4
Select a procedure from the
menus to calculate statistics
Step 5
Run the procedure and
look at the results
2 Widowed character
3 Divorced
4 Separated
5 Never
married
9 Missing
DOB Date of birth None None Date, dd mm yy
To save data
• From the menus choose:
File
Save
(Click on save)
Because these data have not been saved previously you will see a dialogue box prompting you to
enter a file name. Type in the name “attitude” and click OK button. SPSS will then save the
data to this file. (SPSS will automatically attach the ‘.sav’ extension if you do not type it in.)
Aggregate Data
Aggregate Data command combines groups of cases into a single summary case and creates a
new aggregated data file. Refer to Figure 6.10. Cases are aggregated, based on the value of one or
more grouping variables. The new (aggregated) file contains one record for each group. The
aggregate file could be saved with a specific name to be provided by you the user. Otherwise, the
default name is aggregate. Say, For example, the data on learners, achievement could be
aggregated by sex, state and region.
A number of aggregate functions are available in the SPSS. These include sum, mean, number of
cases, maximum value, minimum value, standard deviation, and first and the last value. Other
Summary functions include percentage and fractions below and above a particular cut-off user-
defined value.
Split File
The researcher is often interested in the comparison of a summary and other statistics based on
certain group behavior. For example, in a study of learning achievement, the researcher may be
interested in comparing the mean scores for students belonging to different sex groups. The sex is
taken as a grouping variable. Multiple grouping variables can also be selected. A maximum of
eight grouping variables can be defined. Cases need to be sorted out by grouping variables. Two
options are available for comparative analysis. These are: compare groups and organize output by
groups. The split file is available under Data menu for making such comparisons. Refer to Figure
6.10 above.
Select Cases
Select case command can be used for selecting a random sub-sample or sub-group of cases based
on specified criteria that includes variables and complex expressions. The following criteria are
used for Select Case command.
Select if (condition is satisfied) variable value and their range. Date and time range Arithmetic
expression, Logical expression, Functions, Row numbers.
Following the Select Case command, the unselected cases can either be deleted or temporarily
filtered. Deleted cases are removed from the active file and cannot be recovered. You should be
careful while selecting Delete option. Filtered option will be deleted temporarily. When the Select
Case option is on, it is indicated in the Data Editor window.
Next, let us review the aspects linked with running a preliminary analysis.
144
Data Transformation
Data transformation is a very useful aspect of SPSS. Using data transformation, you can collapse
categories, recode the data and create new variables based on complex equations and conditional
statements. Some of the functions are detailed below:
Compute variable:
• Compute values for numeric or string variables
• Create new variables or replace the value of existing variables. For the new variables, you
can specify the variable type and label.
• Compute values selectively for sub-sets of data based on logical conditions.
• Use built-in functions, statistical functions, distribution functions and string functions.
Recode variables
Recoding of variables is an important characteristic of data management using SPSS. Many
continuous and discrete variables need to be recoded for meaningful analysis. Recoding can be
done either within the same variable or a new variable can be generated. Recoding in the same
variable will replace the original values for this purpose. Recoding in a new variable will replace
the old values with new values. The following example illustrates the need and use of recoding
variables.
A survey of the primary schools was conducted in Delhi. Along with other variables, information
on the type of management was also collected. The management code was designed as follows:
1) Government
2) Local bodies
3) Private aided
4) Private unaided
5) Others
Let us assume that a comparative analysis of the government and the private management schools
is to be undertaken. This will be done by combining categories 1, 2, 3 and 4. This can be achieved
by recoding the management code as 1 (for 1 and 2 categories) and 2 for 3 and 4 categories into a
new variable.
Assuming that a database on primary schools in Delhi is available, the enrolment analysis could
be attempted by making suitable categories, i.e. schools with less than 50 students, 51 - 150, 151 -
250 and more than 250 students. This could be achieved by recoding the enrolment variable into a
new variable ‘category’. The analysis could be attempted by changing the class range for
category. If at a later stage in the analysis, it is found that a new category is to be introduced, it
can again be achieved by recoding the enrolment data.
Count
Count is an important command available in SPSS and is used for counting occurrences of the
same value(s) in a list, if variables are within the same case. For example, a survey might contain
a list of books purchased (yes/no) by the students. You could count the number of ‘yes’
responses, or a new variable can be generated which gives the value of count indicating the
number of books bought.
Procedure to run count command
Choose Transform from the main Menu
Choose count
146
Enter the name of a target variable (variable where the count value will be stored)
Select two or more variables of the same type (numeric or string)
Click define variable and specify which value(s) to be counted.
Click OK after the selection has been made.
In survey on learners’ achievement, the answer code to each question in language and
mathematics could be recorded for each student. The codes could be ‘1’ for the correct answer ‘2’
for the wrong answer and ‘3’ for no reply. Count command can then be used to count the number
of correct answers.
Rank Cases
Rank Cases command can be used to rank observations in ascending or descending order. Other
options available for ranking cases are shown in the right hand panel of the Figure 6.11.
After the appropriate selections have been made, the output is displayed in the output Navigator
window. The chart can be modified by a double click on any part of the chart. Some typical
modifications include the following:
• Edit axis titles and labels and footnotes
• Change scale (X - Y)
• Edit the legend
• Add or modify a title
• Add annotation
• Add an outer frame
Another important category of charts is High-Low which is often used to represent variables like
maximum and minimum temperature in a day, Stock market behavior or other similar variables.
Box-plot and Error Bar charts help you to visualize distribution and dispersion. Box plot displays
the median and quartiles and special symbols are used to identify outliers, if any. Error Bar chart
displays the mean and confidence intervals or standard errors: To obtain a box-plot, choose Box
plot from the Graphs menu. The simple box plot for mean scores obtained in English and Hindi is
shown in Figure 6.14.
Select one or more variables: To do this click the variable “V1” to select it for analysis, then
optionally, you can:
• Click Statistics for descriptive statistics for quantitative variables.
• Click chart for bar chart, pie-charts and histogram.
If you click statistics you will get a dialog box as shown in Figure 6.17.
B. Cross-tabulations
Cross-tabulation is the simplest procedure to describe a relationship between two or more
categories of variables.
Suppose, you are interested in knowing whether there is an association between two categorical
variables such as “gender” and “attitude towards infant feeding”, you have to cross-tabulate the
two variable and use some statistical tests.
To cross-tabulate:
From the menus (refer to Figure 6.15) choose:
Analyze
Descriptive Statistics
Crosstabs...
You will get a dialog box as shown in Figure 6.19.
Select “Gender” for the row variable and “Attitude” for the column variable.
Select one or more row variables and one or more column variables. Optionally you can:
In this dialog box click the box next to the statistical tests you wish to apply. Say for example if
you wish to apply chi-square test, click the box next to chi-square. Then click the continue button.
Next, click the cells button. You will get Crosstabs: Cell Display dialog box as shown in Figure
6.21.Click on the row or column (or both) percentage box. Click the continue button, then click
the OK button. The table shown in Figure 6.21 will appear in the output navigator window:
Compare Means
Independent-Samples T-test
Suppose you are interested in testing the hypothesis. “Do students in each of the three groups of
religious affiliation have similar mean attitude scores?”
From the menu (refer to Figure 6.25) choose
Analyze → Compare Means → One-way ANOVA
You will see the dialog box, as shown in Figure 6.25.
Schroeder, Sjoquist, and Stephan (1986), as well as, the chapter on regression in the SPSS
manuals (which details these analysis options). Click the OK button to run the procedure. The
results of regression analysis will appear in the output navigator window.
Linear regression is the most commonly used procedure for the analysis of a cause and effect
relationship between one dependent variable and a number of independent variables. The
dependent and independent variables should, be quantitative. Categorical variables like sex and
religion should be recoded to dummy (binary) variables or other types of contrast variables. An
important assumption of the regression analysis is that the distribution of the dependent variable
is normal. Moreover, the relationship between the dependent and all the independent variables
should be linear and all observations should be independent of each other.
SPSS provides extensive scope for regression analysis using various types of selection processes.
The method of selecting of independent variables for linear regression analysis is an important
choice which the researcher should consider before running the analysis. You can construct a
variety of regression models from the same set of variables by using different methods.
You can enter all the variables in a single step or enter the independent variables selectively.
Variable selection method is shown in Figure 6.27.
Regression coefficients: The Estimates option displays regression coefficient, β , standard error,
standard coefficient beta, t-value; and two tailed significance level of t. Covariance matrix
displays a variance covariance matrix of regression coefficients with covariance of the diagonal
and variance of the diagonal. A correlation matrix will also be displayed.
Model fit: The variables entered and removed from the model are, displayed. Goodness of fit
statistics, R-square, multiple R, and adjusted R-square, standard error of the estimate and an
analysis of variance table is displayed.
If other options are ticked, the statistics corresponding to each of the options are also displayed in
the Output Navigator. If the data does not show linear relationship and the transformation
procedure does not help, try using Curve Estimation procedure.
Non-Parametric Tests
The non-parametric test procedure provides several tests that do not require assumptions about
the shape of the underlying distribution. These include the following most commonly used tests:
• Chi-square test
• Binomial test
• Run Test
• One sample Kolmogorov Semonov test
• Two independent Sample tests
• Tests for several independent samples
• Two related sample tests
• Tests for several related samples.
Here, we shall discuss the procedure for Chi-square test only. You are advised to consult the
SPSS’ users’ manual and other statistical books for detailed discussion on the other tests.
Chi-Square
Chi-square test (refer to Figure 6.29) is the most commonly used test in social science research.
The goodness of fit test compares the observed and the expected frequencies in each cell/category
to test either that all categories contain the same proportion of values or that each category
contains a user specified proportion of values.
156
• PROC ANOVA: Analysis of variance for all types of designs (one way, two-way and
others).
• PROC FREQ: Frequency distribution for one or more variables.
As pointed out by Klieger (1984) SAS package is comparatively more difficult to use due to its
procedural complexities. For greater details on SAS package, you are advised to consult the
books by Klieger and Sprinthall.
6.5 NUDIST
Computer programmes help in the analysis of qualitative data, especially in understanding a large
(say 500 or more pages) text database. Studies using large databases such as ethnographies with
extensive interviews, computer programmes provide an invaluable aid in research.
NUDIST (Non-numerical unstructured data indexing, searching and theorizing) programme was
developed in Australia in 1991. This package is used for qualitative analysis of data. Here we
present briefly the main features of this package. This software requires, 4 megabytes of RAM
and atleast 2 megabytes space for data files in your PC or MAC. In your PC it operates under
windows (Cres well 1998).
As a researcher this software will help you to provide the following:
1) Storing and organizing files: First establish document files and store information with the
NUDIST programme. Document files consist of transcript from an interview, notes of
observation or any article scanned from a newspaper.
2) Searching for themes: Tag segments of text from all the documents that relate to a single idea
or theme. For example, distance learners, in a study on effectiveness of distance education
talk about the role of academic counselors. The researcher can create a node in NUDIST as
‘Role of Academic Counselors’. Researcher will select text in the transcripts where learners
have talked about this role and merge it into role of Academic Counselors. Information can
be retained in this node and researcher can take print in different ways in which learners talk
about the role of academic counselors.
3) Crossing themes: Taking the same example of role of counselors, the researcher can relate
this node to other nodes. Suppose the other node is qualifications of counselors. There are
two categories like Graduate and Post Graduate. The researcher will ask NUDIST to cross
the two categories, role of counselors and qualification of counselors to see for example
whether there is any relation between graduate counselors and their role than the post
graduate counselors and their role. NUDIST software generates information for a matrix with
information in the cells reflecting different perspectives.
4) Diagramming: In this package; once the information is categorized, categories are identified.
These categories are developed into nine visual picture of the categories that display their
inter connectedness. This is called a tree diagram in NUDIST software. Tree diagram is a
hierarchical tree of categories where root node is at the top and parents and siblings in the
tree. This tree diagram is a useful device for discussing the data analysis of qualitative
research in conferences.
5) Creating a template: In a qualitative research, at the beginning of data analysis, the researcher
will create a template which is apriori code book for organizing information.
For further details on NUDIST software you may like to consult the following:
Kelle, E.(ed.), Computer aided qualitative data analysis, Thousand Oaks, CA: Sage, 1995.
Tesch, R., Qualitative research: Analysis types and software tools, Bristol, PA: Falmer, 1990.
158
EVIEWS stands for Econometric Views. It is a new version of a statistical package for
manipulating time series data. It was originally the Time Series Processor (TSP) software
for large mainframe computers. As an econometric package, EVIEWS provides data
analysis, regression and forecasting tools. EVIEWS can be useful for multipurpose
analytics, but this introduction will focus on financial time series econometric analysis.
Once you get familiar with EVIEWS, the program is very user friendly.
In this section, we will describe how to create a new work file and import data into
EVIEWS. The various ways of handling the data into the work file are as follows:
Before working on any analysis, one must first create a so-called workfile, which must be
of the exact size and type as the data you would like to work with. After the workfile is
created, EVIEWS will let you import data into that from Excel, Lotus, ASCII (text files
etc.). Data from other software packages such as SAS, SPSS, M-FIT, RATS etc. can not
directly imported to EVIEWS.
To create a workfile click File →New → Workfile and the following dialog box will
appear.
If one is working with time series data , then he / she needs to know the frequency of the
data such as daily, weekly monthly, annually etc as well as start and end date for the data.
In case of cross sectional data, one needs the number of the total observations. In case of
cross sectional data, one should choose undated or irregular and enter the start
observation and the end observation in the appropriate textboxes. Let us take an example
where time series data is imported from an excel file using the import function.
159
We have created a data file in excel. The Excel file has saved in the Path….. The
screenshot of the excel file is as follows:
Now the following five steps procedure should be used for importing time series data to
EVIEWS software.
The example has daily (5-days week) data with a start date of April 1991and an end date
of Dec, 2008.The data starts at B2 in the excel sheet called Sheet 1. There are 16
variables. Some of them have very long names and it will be good to make them shorter,
which will be easier to import data into EVIEWS work file and easier to work with later.
As noted in the above workfile, the range as well as the sample is the period between 1st
Jan 1998 and 7th July 2009.There are always two different series, C and RESID, as
default. C is the column that will contain the coefficients from the last regression
equation that one has estimated. RESID is the column that will contain the residuals
from your last estimated model.
3. Click Procs → Import → Read Text- Lotus- Excel. In the dialog box for open
choose the excel format and browse for the file. Select the file and open. It is
important to mention over here that one should close the excel file before trying to
import it to EVIEWS. Otherwise there will be an error message.
4. A dialog box now appears in which it is very crucial to enter the correct
information. Any mistakes could result in an incomplete or even wrong dataset.
This is where our former check-up of the Excel file becomes very important. In
this example the dialog box should be filled out as follows:
161
The order of the data is by observation-series in columns. Upper-left data cell is B2and
the sheet name is Sheet 1. If there is only a single sheet (as in this example) it is not
necessary to name it.
The name of the series / variables have been changed (notice that no spaces are allowed
in the names) in order to make them easier to work with. However, if you would like to
import the names that are given in the excel, you simply enter the number of the variables
(in this case 8). These names can then be changed in EVIEWS using the Rename
function. However, using this method can cause problems if for example the names start
with a number and are very similar (e.g. names such as 30 day return, 5 day price change
etc.).
The sample to import is taken from the workfile. Here it is possible to exclude periods,
which can be useful in case you would like to get rid of any outliers.
It contains a list of the 9 imported variables in alphabetical orders as well as the two
columns for the estimated coefficients and the residuals. It is always better to check if all
the variables have been correctly imported. This is because, while importing all the
variables, one may get the common errors including rows with “NA” and numbers that
are too high or low.
Another useful approach is to open the set of variables or all the variables as a group. One
can do this with following steps:
o clicking the first variable (as per your discretion)
o holding down the [Ctrl] key and clicking the other variable (one can choose
his/ her order of variables)
o clicking View→open as one window→open group
or simply just right clicking or double clicking on either of the selected series
and then clicking open group
5. If you are certain that you have imported the data correctly into the EVIEWS
workfile, you can now save this workfile by clicking File→Save As. The workfile
will be saved in EVIEWS own Wf1 format. A saved workfile can be opened later
by selecting File →Open File→Workfile from the main menu.
While working on time series data, it is often very useful to transform the existing
variables to take care of scale and size and for other purposes. This can be done in
EVIEWS using the [Genr] button in the top right hand corner of your workfile. For
example one would like to work on the stock return file and imported the stock price data
into EVIEWS workfile. The stock return is explained as follows:
This will create the variable RET_STOCKA and include it in the workfile. You can view
the returns by double clicking the variable.
163
Apart from DLOG, there are naturally a number of other mathematical functions as well
as simple addition, subtraction, division and multiplication available. More frequently,
one can use price differential as well as lagged variable in time series data for unit root
purposes. This can be performed by writing the following equation in [Genr] button.
Lag = Stock_A(-1)
Any graph or equation output can easily be copied into a word document. To copy a
table, simply select the area you want to copy and click Edit →Copy. A dialog box
should appear, where you would usually select the first option: Formatted –copy numbers
as they appear in the table, then you go to word /excel and paste the selected area and
change the size of the output until it suits to your document.
To copy a graph, click on it and a blue border should appear, then click Edit →Copy. In
the appearing dialog box, click copy to clickboard and then paste into word / excel. Again
the size can be adjusted to a suitable size.
EVIEWS can be used for examining the data in a variety of ways. This is demonstrated
as follows:
If you want to select a few variables and display a line graph of each of the series, you
can follow the given example based on the previous mentioned EVIEWS workfile.
In this example, we want to view the four time series in the workfile, BSE 100, REER,
NEER, SENSEX. The procedure is to highlight the three variables (using the mouse and
164
[Ctrl] key) followed by a double or right click. Then you click Open Group and click the
[View] button in the appearing spreadsheet From this menu, you can click Multiple
Graphs →Line and four line graphs depicted below as follows. As one can see there are
other choices of graphs as well. In general, clicking the [View] button mentioned above
offers you many options of viewing your selected data.
If you would like to save the output in your workfile for latter use, you first click the
[Freez] button. In the new window which appears, click the [Name] button. In the dialog
box, you enter a name for the output and click OK. Now the output appears with a graph
icon in your workfile.
The procedure is similar to the one just mentioned above, but while using scatter plot,
one can use the other options as well, for example Scatter plots in connection with a
regression model.
We can show an example from our previous mentioned workfile. This is as follows:
165
One can obtain the descriptive statistics and histogram of a series by double clicking the
series in the workfile. In the appearing spreadsheet you click the [View] button and
choose Descriptive Statistics→Histogram and Stats.
If you want to obtain descriptive statistics for several series at a time instead, you
highlight the relevant series (using the mouse and the [Ctrl] key), double or right click
and choose Open Group. In the appearing spreadsheet, click the [View] button and
choose Descriptive Statistics →Individual Samples. This procedure will not give you
the histograms, however. The descriptive statistics and histogram of one time series
variable (SENSEX) is as follows:
166
The easiest way to display correlation and covariance matrices is to highlight the relevant
series (using the mouse and [Ctrl] key] and then click Quick →Group Statistics
→Correlations (or covariances if you want covariance matrix). This creates a new group
and produces a common sample correlation / covariance matrix. If a pair wise
correlation/covariance matrix is more suitable, this is produced by clicking [View] button
and choosing Correlations (or Covariances) →Pairwise Samples. One of the examples
of this is given below.
Before undertaking any time series econometric analysis of the data, it is utmost
important to deseasonalised or to remove the seasonal fluctuations, if the frequency of the
time series data is quaterly or monthly etc. This is one of the major properties of time
series econometrics. To remove the seasonal fluctuations or to deseasonalize the data so
many methods are available, which are given below.
Census X12.
X11 (Historical) Method.
Moving Average Method.
167
To perform the seasonal test, you can select any variable from the workfile and by double
clicking on it you can open the data file of that variable. Then by clicking on the [Procs],
you can find all the above mentioned tests to check the seasonality of the series. The
example of the seasonality test is as follows:
In the following illustration, we will demonstrate how you estimate a regression model in
EVIEWS. When you have opened your workfile, click on the [Objects] button. Select
New Object → Equation and the following dialog box will appear.
Say we want to estimate the Regression equation on stock price and exchange rates. In
the below example, Sensex is the dependent variable and NEER and REER are the
independent variables. You can enter the model in two ways. First you list the variables
followed by C for the intercept term and then the independent variable(s). There must be
single space between variable, so we will enter the following regression into the first
window.
168
After you have entered your perfect equation, select the estimation method and your
sample period. Then, just click OK and get the following output.
The non-stationary nature of most times series data and the need for avoiding the problem
of spurious or nonsense regression calls for the examination of their stationary property.
In brief, variables whose mean, variance and autocovariance (at various lags) change over
time are said to be nonstationary time series or the unit root4 variables. Alternatively, a
time series is stationary, if its mean, variance and autocovariance (at various lags) are
time independent.
Dickey and Fuller (1979) consider three different regression equations that can be used
to test the presence of a unit root:
In the above specifications, the difference among three regressions concerns the presence
of the deterministic elements a0, a2t. The first is a pure random walk model, in the second
an intercept or drift term has been added, and the third equation includes both a drift and
linear time trend. The parameter of interest in all the regression equations is γ; if γ = 0,
the {Yt} sequence contains a unit root. The test involves estimating one or more of the
equations above using OLS in order to obtain the estimated value of γ and associated
standard error. Comparing the resulting t-statistic with the appropriate value reported in
the Dickey Fuller tables allows to determine whether to accept or reject the null
hypothesis γ = 0.
In conducting the Dickey Fuller test as in equations 1, 2 and 3, it was assumed that the
error term εt was uncorrelated. But incase the error term εt is autocorrelated , Dickey and
Fuller have developed a test, known as the Augmented Dickey Fuller (ADF) test.
4
The term unit root refers to the root of the polynomial in the lag operator.
169
k
ΔYt = a 0 + a1 t + γYt −1 + β i ∑ ΔYt − i + ε t …. (3.1)
i =1
Where εt is a pure white noise error term Δ is the difference operator and γ and β are the
parameters. In the ADF test, we still test whether γ = 0 and the ADF test follows the same
asymptotic distribution as the DF statistic, so the same critical values can be used.
This test is quickly done in EVIEWS by double clicking the relevant time series go to the
spreadsheet view. Here, click the [View] button and select Unit Root Test. Alternatively,
You can click [Quick] button and select series and Unit Root Test. The following dialog
box should appear.
170
In the above dialog box, you have to first select the test type. Next select whether you
want to test unit root at level, first difference or second difference. And finally, you can
choose whether you would like to perform the unit root test with none, with intercept or
with trend and intercept (the equations are explained above).
Now we will examine whether a time series is stationary or not. In the following example
we have examined the unit root of stock index (SENSEX) both at level and first
difference by using Augmented Dickey Fuller Test. The results are as follows:
171
From the above result, the ADF test statistics to the left (-4.083999) is greater than the
critical values at all the statistical significance level. Hence the null hypothesis of unit
root is rejected and SENSEX is stationary at it’s first difference level.
Granger’s causality may be defined as the forecasting relationship between two variables
proposed by Granger (1969) and popularised by Sims (1972). In brief, Granger causality
test states that if S & E are two time series variable and if past values of a variable S
significantly contribute to forecast the value of another variable ES is said to be Granger
cause E and vice versa. The test involves with the following two regression equations
172
n n
St = γ 0 + ∑i =1
α iEt−i + ∑ j =1
β j S t − j + u 1 t …(4)
m m
Et = γ 1 + ∑
i =1
λiEt−i + ∑
j =1
δ j S t − j + u 2 t ….…(5)
Where St and Et are the stock price and exchange rate to be tested, and u1t and u2t are
mutually uncorrelated white noise errors, and t denotes the time period. Equation (4)
postulates that current S is related to past values of S as well as of past E. Similarly,
equation (5) postulates that E is related to past values of E and S. The null hypothesis for
equation (4) is that there is no causation from S to E, thus the coefficients on the lagged S
n
are not significant, ∑β
j =1
j = 0 . Similarly, the null hypothesis for equation (5) is that there
∑δ
j =1
j = 0 . Three possible conclusions that can be addressed from such analysis include
Granger causality test can easily performed in EVIEWS. In the workfile, you can select
the group of variable in your choice and click the [View] menu and select Granger’s
Causality. The result of Granger’s causality should appear as follows:
By the very construction, a VAR system consists of a set of variables, each of which is
related to lags of itself, and of all other variables in the system. In other words, a VAR
system consists of a set of regression equations, each of which has an adjustment
mechanism such that even small changes in one variable component in the system may be
accounted automatically by possible adjustments in the rest of the variables in the system.
Thus, VAR provide a fairly unrestrictive approximation to a reduced form structural
model without assuming beforehand any of the variables as exogenous. Thus, by
avoiding the imposition of a priori restrictions on the model, the VAR adds significantly
to the flexibility of the model.
173
Steps of VAR:
After setting the lag length, now we are in a position to estimate the model. But it may be
noted that the coefficients obtained from the estimation of VAR model can’t be
interpreted directly. To overcome this problem, Litterman (1979) had suggested the use
of Innovation Accounting Techniques, which consists of both Impulse response functions
(IRFS) and Variance Decompositions (VDS).
Impulse response function is being used to trace out the dynamic interaction among
variables. It shows the dynamic response of all the variables in the system to a shock or
innovation. For computing the IRFS, it is essential that the variables in the system are
ordered.
Variance Decomposition:
Variance decomposition is used to detect the causal relations among the variables. It
explains the extent at which a variable is explained by the shocks in all the variables in
the system. The forecast error variance decomposition explains the proportion of the
movements in a sequence due to its own shocks verses shocks to the other variables.
Now we can take an example of VAR modeling between stock price and exchange rates.
Let us say we have considered SENSEX and BSE 100 to represent stock market and
NEER and REER to represent the effective exchange rates. In the EVIEWS workfile, you
can select the group of variables and open the group of variables. Then you click [Quick]
and select estimate VAR. The following dialog box should appear. Enter all the variables
174
under endogenous variables. Choose the optimum lag length as per the lag augmentation
criterion (refer next section) and click OK.
After clicking OK, the following output will be generated. As mentioned above, after
generating the following output, click[File] and select Lag Structure and select Lag
Leangth Criteria. This is given as follows:
Then the following output on lag length will be generated for various statistical lag
augmentation criterions. One can choose the optimum lag length as per any given
criterion.
175
For Impulse Response Function, you can click [File] and select Impulse Response..Then
the following dialog box should appear.
Now you can select the output in display format either in Table or Graph. Then select the
impulse and response variables along with periods ahead for forecasting. While doing so,
the following dialog box should appear.
In the above dialog box, we have selected the output in multiple graphs format. By
clicking OK, the following output table will be generated.
176
In the similar process, you can generate the variance decomposition output which is
shown as follows.