Sunteți pe pagina 1din 25

Statistics for Managers

AN IIM ALUMNI Venture

P.A.C.E
Shaping Careers in Finance
Defining and collecting data

AN IIM ALUMNI Venture

P.A.C.E
Shaping Careers in Finance
Agenda
 DCOVA framework for decision making
 Understanding variable types (Define Task)
 Measurement scales for variables
 Data Collection
 Data Sources
 Populations and Samples
 Data Cleaning
 Recoding Variables
 Types of Sampling Methods
 Commonly used probability samples
 Types of Survey Errors
 Ethical issues and sampling errors
DCOVA framework for decision making
 Define the variables that you want to study in order to solve a problem or meet an objective
 Collect the data for those variables from appropriate sources
 Organize the data (Tables)
 Visualize the data (Charts)
 Analyze the data  Reach conclusions and present those results
Understanding variable types (Define Task)
 Statistical methods to use vary according to type of the variable
 Categorical Variables (Qualitative Variables)
 Have values that can only be placed into categories
 Numerical Variables (Quantitative Variables)
 Have values that represent quantities
 Discrete Variables  Have numerical values that arise from a counting process (Integers)
 Continuous Variables  Produce numerical responses that arise from a measuring process (any value within a
continuum, or an interval)
 Some variables can be either categorical or numerical, depending on how you define them (Age)
Measurement scales for variables
 Nominal and Ordinal scales
 Describe the values for a categorical variable
 Nominal Scale  classifies data into distinct categories in which no ranking is implied (Gender, State etc.)  weakest
form of measurement
 Ordinal Scale  classifies values into distinct categories in which ranking is implied (Ratings, Grade etc.)  stronger
form of measurement than nominal scaling  scale does not account for the amount of the differences between the
categories (by how much)
 Interval and Ratio scales
 Describe the values for a numerical variable
 Interval Scale  Ordered scale in which the difference between measurements is a meaningful quantity but does not
involve a true zero point (Temperature)
 Ratio Scale  Ordered scale in which the difference between the measurements involves a true zero point (Height)
 Stronger forms of measurement than an ordinal scale
Data Collection
 Garbage in – Garbage out
 Tasks in Data Collection
 Identifying data sources
 Deciding whether the data you collect will be from a population or a sample
 Data Cleaning
 Recoding the variables if required
Data Sources
 Primary Data Source
 You collect your own data for analysis
 Secondary Data source
 Data for your analysis have been collected by someone else
 Modes of Data Collection
 Data distributed by an organization or individual (Market research companies, Investment Services firms, print and
online media companies)
 Outcomes of a designed experiment (conduct an experiment that compares features)
 Survey responses (questions about their beliefs, attitudes, behaviors, and other characteristics)  Can be affected by
errors
 Observational study  Collect data by directly observing a behavior, usually in a natural or neutral setting
 Focus groups to elicit unstructured responses to open-ended questions
 Used to enhance teamwork or improve the quality of products and services
 Data collected by ongoing business activities  collected from operational & transactional systems (Online or offline)
Populations and Samples
 Population
 Consists of all the items or individuals about which you want to reach conclusions
 Sample
 Portion of a population selected for analysis
 Results of analyzing a sample are used to estimate characteristics of the entire population
 Advantages of using a sample
 Less time-consuming
 Less costly
 Less cumbersome and more practical
Data Cleaning
 There may be irregularities in the data you collect
 Common Data Problems
 Typographical or data entry errors
 Values that are impossible or undefined
 Values that are Missing
 Outliers for numeric values  values that seem excessively different from most of the rest of the values (may or may
not be errors)
 More sophisticated statistical programs have provisions to process data that contain occasional missing values
(EXCEL does not)
Recoding Variables
 Need to reconsider the categories that you have defined for a categorical variable
 Need to transform a numerical variable into a categorical variable by assigning the individual numeric data
values to one of several groups
 Ensure Mutual Exclusivity
 Category definitions cause each data value to be placed in one and only one category
 Ensure the recoding is collectively exhaustive
 Set of categories you create for the new, recoded variables include all the data values being recoded
Types of Sampling Methods
 Define a frame
 Complete or partial listing of the items that make up the population from which the sample will be selected
 Inaccurate or biased results can occur if a frame excludes certain groups, or portions of the population
 Select either a nonprobability sample or a probability sample
 Nonprobability sample  Select the items or individuals without knowing their probabilities of selection
 Convenience, speed, and low cost
 Used for small scale pilot analysis
 Convenient Sample  select items that are easy, inexpensive, or convenient to sample (Self selected website visitors for web
surveys)
 Judgement Sample  Collect the opinions of preselected experts in the subject matter  Cannot generalize their results to the
population
 Probability Sample  select items based on known probabilities (allow you to make inferences about the population
being analyzed)
 Theory of statistical inference depends on probability sampling
Commonly used probability samples
 Simple Random Sample
 Every item from a frame has the same chance of selection as every other item
 Every sample of a fixed size has the same chance of selection as every other sample of that size
 Most elementary random sampling technique
 Sampling with replacement  After you select an item, you return it to the frame, where it has the same probability of
being selected again  Repeat this process until you have selected the desired sample size
 Sampling without replacement  Once you select an item, you cannot select it again
 Table of random numbers can be used for selection
 Systematic sample
 Partition the N items in the frame into n groups of k items (round k to the nearest integer)
 Choose the first item to be selected at random from the first k items in the frame
 Select the remaining n – 1 items by taking every kth item thereafter from the entire frame
 Prone to selection bias that can occur when there is a pattern in the frame
Commonly used probability samples
 Stratified sample
 Subdivide the N items in the frame into separate subpopulations (Strata)  defined by some common characteristic,
such as gender
 Select a simple random sample within each of the strata and combine the results from the separate simple random
samples
 Ensured of the representation of items across the entire population
 Homogeneity of items within each stratum provides greater precision in the estimates of underlying population
parameters
 Cluster Sample
 Divide the N items in the frame into clusters that contain several items (naturally occurring designations, such as
counties)
 Take a random sample of one or more clusters and study all items in each selected cluster
 More cost-effective than simple random sampling especially if the population is spread over a wide geographic region
 Require a larger sample size to produce results as precise as those from simple random sampling or stratified
sampling
Types of Survey Errors
 Evaluate the purpose of the survey, why it was conducted, and for whom it was conducted
 Surveys that use nonprobability sampling methods are subject to serious biases that may make the results
meaningless
 Coverage Error (Having an adequate frame)
 Occurs if certain groups of items are excluded from the frame so that they have no chance of being selected in the sample
 Results in a selection bias
 Non response Error (Not everyone is willing to respond to a survey)
 Failure to collect data on all items in the sample
 Results in a non response bias
 personal interviews and telephone interviews usually produce a higher response rate than do mail surveys—but at a higher
cost
 Sampling Error
 Reflects the variation from sample to sample, based on the probability of particular individuals or items being selected in the
particular samples
 Margin of error can be reduced by using larger sample sizes but at a higher cost
Types of Survey Errors
 Measurement Errors
 When surveys rely on self-reported information, the mode of data collection, the respondent to the survey, and or the
survey itself can be possible sources of measurement error
 Satisficing, social desirability, reading ability, and/or interviewer effects can be dependent on the mode
 Social desirability bias or cognitive/memory limitations of a respondent can affect the results
 Vague questions, double-barrelled questions that ask about multiple issues but require a single response, or
questions that ask the respondent to report something that occurs over time but fail to clearly define the extent of
time
 Need to standardize survey administration and respondent understanding of questions
Ethical issues and sampling errors
 Coverage error  If particular groups or individuals are purposely excluded from the frame so that the
survey results are more favorable to the survey’s sponsor
 Nonresponse error  If the sponsor knowingly designs the survey so that particular groups or individuals
are less likely than others to respond
 Sampling error  if the findings are purposely presented without reference to sample size and margin of
error so that the sponsor can promote a viewpoint that might otherwise be inappropriate
 Measurement error
 Survey sponsor chooses leading questions that guide the respondent in a particular direction
 An interviewer, through mannerisms and tone, purposely makes a respondent obligated to please the interviewer or
otherwise guides the respondent in a particular direction
 Respondent willfully provides false information
 Ethical issues also arise when the results of nonprobability samples are used to form conclusions about the
entire population
Summary
 DCOVA framework for decision making
 Understanding variable types (Define Task)
 Measurement scales for variables
 Data Collection
 Data Sources
 Populations and Samples
 Data Cleaning
 Recoding Variables
 Types of Sampling Methods
 Commonly used probability samples
 Types of Survey Errors
 Ethical issues and sampling errors
Organizing Data

AN IIM ALUMNI Venture

P.A.C.E
Shaping Careers in Finance
Agenda
 Organizing Categorical Data
 Organizing Numeric data
 Frequency Distribution
Organizing Categorical Data
 Tally the values of a variable by categories and placing the results in tables
 Summary Table
 Tallies the values as frequencies or percentages for each category
 Helps you see the differences among the categories by displaying the frequency, amount, or percentage of items in a
set of categories in a separate column
 Contingency Table
 Cross-tabulates the values of two or more categorical variables
 Helps you to study patterns that may exist between the variables
 Can be shown as a frequency, a percentage of the overall total, a percentage of the row total, or a percentage of the
column total
Organizing Numeric data
 Create ordered arrays or distributions
 Common Analysis
 Analyze the numerical variables by groups that are defined by the values of a second, categorical variable
 Stacked Format
 All of the values for a numerical variable appear in one column and a second, separate column contains the
categorical values that identify to which subgroup each numerical value belongs
 Unstacked format
 Values for a numerical variable are divided by subgroup and placed in separate columns
 Ordered Array
 Arranges the values of a numerical variable in rank order, from the smallest value to the largest value
 Helps to get a better sense of the range of values in the data
Frequency Distribution
 Tallies the values of a numerical variable into a set of numerically ordered classes
 Class Interval  each class groups a mutually exclusive range of values
 Each value can be assigned to only one class, and every value must be contained in one of the class intervals
 Creating a frequency distribution
 Consider how many classes would be appropriate for your data (5 to 15 is ideal)
 Determine a suitable width for each class interval (subtract the lowest value from the highest value and divide that
result by the number of classes)
 Establish proper and clearly defined class boundaries for each class
 Well-chosen class intervals lead to class midpoints that are simple to read and interpret

 Classes and EXCEL Bins


 Bins and classes are both ranges of values, bins do not have explicitly stated intervals
 Each bin number explicitly states the upper boundary of its bin (Lowers are defined implicitly)
 First bin always has negative infinity as its lower boundary
Frequency Distribution
 Relative Frequency Distribution
 Presents the relative frequency, or proportion, of the total for each group that each class represents
 Equal to the number of values in each class divided by the total number of values
 The total of the relative frequency column must always be 1.00
 Percentage Distribution
 Presents the percentage of the total for each group that each class represents
 Proportion multiplied by 100%
 The total of the percentage column must always be 100
 Cumulative Distribution
 Way of presenting information about the percentage of values that are less than a specific amount
 Rows of a cumulative distribution do not correspond to class intervals (Includes all rows above it also)
Summary
 Organizing Categorical Data
 Organizing Numeric data
 Frequency Distribution

S-ar putea să vă placă și