Defining and Collecting Data

Statistics for Managers
AN IIM ALUMNI Venture
P.A.C.E
Shaping Careers in Finance
Defining and collecting data
P.A.C.E
Agenda
 DCOVA framework for decision making
 Understanding variable types (Define Task)
 Measurement scales for variables
 Data Collection
 Data Sources
 Populations and Samples
 Data Cleaning
 Recoding Variables
 Types of Sampling Methods
 Commonly used probability samples
 Types of Survey Errors
 Ethical issues and sampling errors
DCOVA framework for decision making
 Define the variables that you want to study in order to solve a problem or meet an objective
 Collect the data for those variables from appropriate sources
 Organize the data (Tables)
 Visualize the data (Charts)
 Analyze the data  Reach conclusions and present those results
Understanding variable types (Define Task)
 Statistical methods to use vary according to type of the variable
 Categorical Variables (Qualitative Variables)
 Have values that can only be placed into categories
 Numerical Variables (Quantitative Variables)
 Have values that represent quantities
 Discrete Variables  Have numerical values that arise from a counting process (Integers)
 Continuous Variables  Produce numerical responses that arise from a measuring process (any value within a
continuum, or an interval)
 Some variables can be either categorical or numerical, depending on how you define them (Age)
Measurement scales for variables
 Nominal and Ordinal scales
 Describe the values for a categorical variable
 Nominal Scale  classifies data into distinct categories in which no ranking is implied (Gender, State etc.)  weakest
form of measurement
 Ordinal Scale  classifies values into distinct categories in which ranking is implied (Ratings, Grade etc.)  stronger
form of measurement than nominal scaling  scale does not account for the amount of the differences between the
categories (by how much)
 Interval and Ratio scales
 Describe the values for a numerical variable
 Interval Scale  Ordered scale in which the difference between measurements is a meaningful quantity but does not
involve a true zero point (Temperature)
 Ratio Scale  Ordered scale in which the difference between the measurements involves a true zero point (Height)
 Stronger forms of measurement than an ordinal scale
Data Collection
 Garbage in – Garbage out
 Tasks in Data Collection
 Identifying data sources
 Deciding whether the data you collect will be from a population or a sample
 Data Cleaning
 Recoding the variables if required
Data Sources
 Primary Data Source
 You collect your own data for analysis
 Secondary Data source
 Data for your analysis have been collected by someone else
 Modes of Data Collection
 Data distributed by an organization or individual (Market research companies, Investment Services firms, print and
online media companies)
 Outcomes of a designed experiment (conduct an experiment that compares features)
 Survey responses (questions about their beliefs, attitudes, behaviors, and other characteristics)  Can be affected by
errors
 Observational study  Collect data by directly observing a behavior, usually in a natural or neutral setting
 Focus groups to elicit unstructured responses to open-ended questions
 Used to enhance teamwork or improve the quality of products and services
 Data collected by ongoing business activities  collected from operational & transactional systems (Online or offline)
Populations and Samples
 Population
 Consists of all the items or individuals about which you want to reach conclusions
 Sample
 Portion of a population selected for analysis
 Results of analyzing a sample are used to estimate characteristics of the entire population
 Advantages of using a sample
 Less time-consuming
 Less costly
 Less cumbersome and more practical
Data Cleaning
 There may be irregularities in the data you collect
 Common Data Problems
 Typographical or data entry errors
 Values that are impossible or undefined
 Values that are Missing
 Outliers for numeric values  values that seem excessively different from most of the rest of the values (may or may
not be errors)
 More sophisticated statistical programs have provisions to process data that contain occasional missing values
(EXCEL does not)
Recoding Variables
 Need to reconsider the categories that you have defined for a categorical variable
 Need to transform a numerical variable into a categorical variable by assigning the individual numeric data
values to one of several groups
 Ensure Mutual Exclusivity
 Category definitions cause each data value to be placed in one and only one category
 Ensure the recoding is collectively exhaustive
 Set of categories you create for the new, recoded variables include all the data values being recoded
Types of Sampling Methods
 Define a frame
 Complete or partial listing of the items that make up the population from which the sample will be selected
 Inaccurate or biased results can occur if a frame excludes certain groups, or portions of the population
 Select either a nonprobability sample or a probability sample
 Nonprobability sample  Select the items or individuals without knowing their probabilities of selection
 Convenience, speed, and low cost
 Used for small scale pilot analysis
 Convenient Sample  select items that are easy, inexpensive, or convenient to sample (Self selected website visitors for web
surveys)
 Judgement Sample  Collect the opinions of preselected experts in the subject matter  Cannot generalize their results to the
population
 Probability Sample  select items based on known probabilities (allow you to make inferences about the population
being analyzed)
 Theory of statistical inference depends on probability sampling
Commonly used probability samples
 Simple Random Sample
 Every item from a frame has the same chance of selection as every other item
 Every sample of a fixed size has the same chance of selection as every other sample of that size
 Most elementary random sampling technique
 Sampling with replacement  After you select an item, you return it to the frame, where it has the same probability of
being selected again  Repeat this process until you have selected the desired sample size
 Sampling without replacement  Once you select an item, you cannot select it again
 Table of random numbers can be used for selection
 Systematic sample
 Partition the N items in the frame into n groups of k items (round k to the nearest integer)
 Choose the first item to be selected at random from the first k items in the frame
 Select the remaining n – 1 items by taking every kth item thereafter from the entire frame
 Prone to selection bias that can occur when there is a pattern in the frame
Commonly used probability samples
 Stratified sample
 Subdivide the N items in the frame into separate subpopulations (Strata)  defined by some common characteristic,
such as gender
 Select a simple random sample within each of the strata and combine the results from the separate simple random
samples
 Ensured of the representation of items across the entire population
 Homogeneity of items within each stratum provides greater precision in the estimates of underlying population
parameters
 Cluster Sample
 Divide the N items in the frame into clusters that contain several items (naturally occurring designations, such as
counties)
 Take a random sample of one or more clusters and study all items in each selected cluster
 More cost-effective than simple random sampling especially if the population is spread over a wide geographic region
 Require a larger sample size to produce results as precise as those from simple random sampling or stratified
sampling
Types of Survey Errors
 Evaluate the purpose of the survey, why it was conducted, and for whom it was conducted
 Surveys that use nonprobability sampling methods are subject to serious biases that may make the results
meaningless
 Coverage Error (Having an adequate frame)
 Occurs if certain groups of items are excluded from the frame so that they have no chance of being selected in the sample
 Results in a selection bias
 Non response Error (Not everyone is willing to respond to a survey)
 Failure to collect data on all items in the sample
 Results in a non response bias
 personal interviews and telephone interviews usually produce a higher response rate than do mail surveys—but at a higher
cost
 Sampling Error
 Reflects the variation from sample to sample, based on the probability of particular individuals or items being selected in the
particular samples
 Margin of error can be reduced by using larger sample sizes but at a higher cost
Types of Survey Errors
 Measurement Errors
 When surveys rely on self-reported information, the mode of data collection, the respondent to the survey, and or the
survey itself can be possible sources of measurement error
 Satisficing, social desirability, reading ability, and/or interviewer effects can be dependent on the mode
 Social desirability bias or cognitive/memory limitations of a respondent can affect the results
 Vague questions, double-barrelled questions that ask about multiple issues but require a single response, or
questions that ask the respondent to report something that occurs over time but fail to clearly define the extent of
time
 Need to standardize survey administration and respondent understanding of questions
Ethical issues and sampling errors
 Coverage error  If particular groups or individuals are purposely excluded from the frame so that the
survey results are more favorable to the survey’s sponsor
 Nonresponse error  If the sponsor knowingly designs the survey so that particular groups or individuals
are less likely than others to respond
 Sampling error  if the findings are purposely presented without reference to sample size and margin of
error so that the sponsor can promote a viewpoint that might otherwise be inappropriate
 Measurement error
 Survey sponsor chooses leading questions that guide the respondent in a particular direction
 An interviewer, through mannerisms and tone, purposely makes a respondent obligated to please the interviewer or
otherwise guides the respondent in a particular direction
 Respondent willfully provides false information
 Ethical issues also arise when the results of nonprobability samples are used to form conclusions about the
entire population
Summary
 DCOVA framework for decision making
 Understanding variable types (Define Task)
 Measurement scales for variables
 Data Collection
 Data Sources
 Populations and Samples
 Data Cleaning
 Recoding Variables
 Types of Sampling Methods
 Commonly used probability samples
 Types of Survey Errors
 Ethical issues and sampling errors
Organizing Data
P.A.C.E
Agenda
 Organizing Categorical Data
 Organizing Numeric data
 Frequency Distribution
Organizing Categorical Data
 Tally the values of a variable by categories and placing the results in tables
 Summary Table
 Tallies the values as frequencies or percentages for each category
 Helps you see the differences among the categories by displaying the frequency, amount, or percentage of items in a
set of categories in a separate column
 Contingency Table
 Cross-tabulates the values of two or more categorical variables
 Helps you to study patterns that may exist between the variables
 Can be shown as a frequency, a percentage of the overall total, a percentage of the row total, or a percentage of the
column total
Organizing Numeric data
 Create ordered arrays or distributions
 Common Analysis
 Analyze the numerical variables by groups that are defined by the values of a second, categorical variable
 Stacked Format
 All of the values for a numerical variable appear in one column and a second, separate column contains the
categorical values that identify to which subgroup each numerical value belongs
 Unstacked format
 Values for a numerical variable are divided by subgroup and placed in separate columns
 Ordered Array
 Arranges the values of a numerical variable in rank order, from the smallest value to the largest value
 Helps to get a better sense of the range of values in the data
Frequency Distribution
 Tallies the values of a numerical variable into a set of numerically ordered classes
 Class Interval  each class groups a mutually exclusive range of values
 Each value can be assigned to only one class, and every value must be contained in one of the class intervals
 Creating a frequency distribution
 Consider how many classes would be appropriate for your data (5 to 15 is ideal)
 Determine a suitable width for each class interval (subtract the lowest value from the highest value and divide that
result by the number of classes)
 Establish proper and clearly defined class boundaries for each class
 Well-chosen class intervals lead to class midpoints that are simple to read and interpret
 Classes and EXCEL Bins

 Bins and classes are both ranges of values, bins do not have explicitly stated intervals
 Each bin number explicitly states the upper boundary of its bin (Lowers are defined implicitly)
 First bin always has negative infinity as its lower boundary
Frequency Distribution
 Relative Frequency Distribution
 Presents the relative frequency, or proportion, of the total for each group that each class represents
 Equal to the number of values in each class divided by the total number of values
 The total of the relative frequency column must always be 1.00
 Percentage Distribution
 Presents the percentage of the total for each group that each class represents
 Proportion multiplied by 100%
 The total of the percentage column must always be 100
 Cumulative Distribution
 Way of presenting information about the percentage of values that are less than a specific amount
 Rows of a cumulative distribution do not correspond to class intervals (Includes all rows above it also)
Summary
 Organizing Categorical Data
 Organizing Numeric data
 Frequency Distribution

Defining and Collecting Data

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Defining and Collecting Data

Încărcat de

Drepturi de autor:

Formate disponibile

Statistics for Managers

AN IIM ALUMNI Venture

AN IIM ALUMNI Venture

AN IIM ALUMNI Venture

 Classes and EXCEL Bins

S-ar putea să vă placă și