Documente Academic
Documente Profesional
Documente Cultură
P.A.C.E
Shaping Careers in Finance
Defining and collecting data
P.A.C.E
Shaping Careers in Finance
Agenda
DCOVA framework for decision making
Understanding variable types (Define Task)
Measurement scales for variables
Data Collection
Data Sources
Populations and Samples
Data Cleaning
Recoding Variables
Types of Sampling Methods
Commonly used probability samples
Types of Survey Errors
Ethical issues and sampling errors
DCOVA framework for decision making
Define the variables that you want to study in order to solve a problem or meet an objective
Collect the data for those variables from appropriate sources
Organize the data (Tables)
Visualize the data (Charts)
Analyze the data Reach conclusions and present those results
Understanding variable types (Define Task)
Statistical methods to use vary according to type of the variable
Categorical Variables (Qualitative Variables)
Have values that can only be placed into categories
Numerical Variables (Quantitative Variables)
Have values that represent quantities
Discrete Variables Have numerical values that arise from a counting process (Integers)
Continuous Variables Produce numerical responses that arise from a measuring process (any value within a
continuum, or an interval)
Some variables can be either categorical or numerical, depending on how you define them (Age)
Measurement scales for variables
Nominal and Ordinal scales
Describe the values for a categorical variable
Nominal Scale classifies data into distinct categories in which no ranking is implied (Gender, State etc.) weakest
form of measurement
Ordinal Scale classifies values into distinct categories in which ranking is implied (Ratings, Grade etc.) stronger
form of measurement than nominal scaling scale does not account for the amount of the differences between the
categories (by how much)
Interval and Ratio scales
Describe the values for a numerical variable
Interval Scale Ordered scale in which the difference between measurements is a meaningful quantity but does not
involve a true zero point (Temperature)
Ratio Scale Ordered scale in which the difference between the measurements involves a true zero point (Height)
Stronger forms of measurement than an ordinal scale
Data Collection
Garbage in – Garbage out
Tasks in Data Collection
Identifying data sources
Deciding whether the data you collect will be from a population or a sample
Data Cleaning
Recoding the variables if required
Data Sources
Primary Data Source
You collect your own data for analysis
Secondary Data source
Data for your analysis have been collected by someone else
Modes of Data Collection
Data distributed by an organization or individual (Market research companies, Investment Services firms, print and
online media companies)
Outcomes of a designed experiment (conduct an experiment that compares features)
Survey responses (questions about their beliefs, attitudes, behaviors, and other characteristics) Can be affected by
errors
Observational study Collect data by directly observing a behavior, usually in a natural or neutral setting
Focus groups to elicit unstructured responses to open-ended questions
Used to enhance teamwork or improve the quality of products and services
Data collected by ongoing business activities collected from operational & transactional systems (Online or offline)
Populations and Samples
Population
Consists of all the items or individuals about which you want to reach conclusions
Sample
Portion of a population selected for analysis
Results of analyzing a sample are used to estimate characteristics of the entire population
Advantages of using a sample
Less time-consuming
Less costly
Less cumbersome and more practical
Data Cleaning
There may be irregularities in the data you collect
Common Data Problems
Typographical or data entry errors
Values that are impossible or undefined
Values that are Missing
Outliers for numeric values values that seem excessively different from most of the rest of the values (may or may
not be errors)
More sophisticated statistical programs have provisions to process data that contain occasional missing values
(EXCEL does not)
Recoding Variables
Need to reconsider the categories that you have defined for a categorical variable
Need to transform a numerical variable into a categorical variable by assigning the individual numeric data
values to one of several groups
Ensure Mutual Exclusivity
Category definitions cause each data value to be placed in one and only one category
Ensure the recoding is collectively exhaustive
Set of categories you create for the new, recoded variables include all the data values being recoded
Types of Sampling Methods
Define a frame
Complete or partial listing of the items that make up the population from which the sample will be selected
Inaccurate or biased results can occur if a frame excludes certain groups, or portions of the population
Select either a nonprobability sample or a probability sample
Nonprobability sample Select the items or individuals without knowing their probabilities of selection
Convenience, speed, and low cost
Used for small scale pilot analysis
Convenient Sample select items that are easy, inexpensive, or convenient to sample (Self selected website visitors for web
surveys)
Judgement Sample Collect the opinions of preselected experts in the subject matter Cannot generalize their results to the
population
Probability Sample select items based on known probabilities (allow you to make inferences about the population
being analyzed)
Theory of statistical inference depends on probability sampling
Commonly used probability samples
Simple Random Sample
Every item from a frame has the same chance of selection as every other item
Every sample of a fixed size has the same chance of selection as every other sample of that size
Most elementary random sampling technique
Sampling with replacement After you select an item, you return it to the frame, where it has the same probability of
being selected again Repeat this process until you have selected the desired sample size
Sampling without replacement Once you select an item, you cannot select it again
Table of random numbers can be used for selection
Systematic sample
Partition the N items in the frame into n groups of k items (round k to the nearest integer)
Choose the first item to be selected at random from the first k items in the frame
Select the remaining n – 1 items by taking every kth item thereafter from the entire frame
Prone to selection bias that can occur when there is a pattern in the frame
Commonly used probability samples
Stratified sample
Subdivide the N items in the frame into separate subpopulations (Strata) defined by some common characteristic,
such as gender
Select a simple random sample within each of the strata and combine the results from the separate simple random
samples
Ensured of the representation of items across the entire population
Homogeneity of items within each stratum provides greater precision in the estimates of underlying population
parameters
Cluster Sample
Divide the N items in the frame into clusters that contain several items (naturally occurring designations, such as
counties)
Take a random sample of one or more clusters and study all items in each selected cluster
More cost-effective than simple random sampling especially if the population is spread over a wide geographic region
Require a larger sample size to produce results as precise as those from simple random sampling or stratified
sampling
Types of Survey Errors
Evaluate the purpose of the survey, why it was conducted, and for whom it was conducted
Surveys that use nonprobability sampling methods are subject to serious biases that may make the results
meaningless
Coverage Error (Having an adequate frame)
Occurs if certain groups of items are excluded from the frame so that they have no chance of being selected in the sample
Results in a selection bias
Non response Error (Not everyone is willing to respond to a survey)
Failure to collect data on all items in the sample
Results in a non response bias
personal interviews and telephone interviews usually produce a higher response rate than do mail surveys—but at a higher
cost
Sampling Error
Reflects the variation from sample to sample, based on the probability of particular individuals or items being selected in the
particular samples
Margin of error can be reduced by using larger sample sizes but at a higher cost
Types of Survey Errors
Measurement Errors
When surveys rely on self-reported information, the mode of data collection, the respondent to the survey, and or the
survey itself can be possible sources of measurement error
Satisficing, social desirability, reading ability, and/or interviewer effects can be dependent on the mode
Social desirability bias or cognitive/memory limitations of a respondent can affect the results
Vague questions, double-barrelled questions that ask about multiple issues but require a single response, or
questions that ask the respondent to report something that occurs over time but fail to clearly define the extent of
time
Need to standardize survey administration and respondent understanding of questions
Ethical issues and sampling errors
Coverage error If particular groups or individuals are purposely excluded from the frame so that the
survey results are more favorable to the survey’s sponsor
Nonresponse error If the sponsor knowingly designs the survey so that particular groups or individuals
are less likely than others to respond
Sampling error if the findings are purposely presented without reference to sample size and margin of
error so that the sponsor can promote a viewpoint that might otherwise be inappropriate
Measurement error
Survey sponsor chooses leading questions that guide the respondent in a particular direction
An interviewer, through mannerisms and tone, purposely makes a respondent obligated to please the interviewer or
otherwise guides the respondent in a particular direction
Respondent willfully provides false information
Ethical issues also arise when the results of nonprobability samples are used to form conclusions about the
entire population
Summary
DCOVA framework for decision making
Understanding variable types (Define Task)
Measurement scales for variables
Data Collection
Data Sources
Populations and Samples
Data Cleaning
Recoding Variables
Types of Sampling Methods
Commonly used probability samples
Types of Survey Errors
Ethical issues and sampling errors
Organizing Data
P.A.C.E
Shaping Careers in Finance
Agenda
Organizing Categorical Data
Organizing Numeric data
Frequency Distribution
Organizing Categorical Data
Tally the values of a variable by categories and placing the results in tables
Summary Table
Tallies the values as frequencies or percentages for each category
Helps you see the differences among the categories by displaying the frequency, amount, or percentage of items in a
set of categories in a separate column
Contingency Table
Cross-tabulates the values of two or more categorical variables
Helps you to study patterns that may exist between the variables
Can be shown as a frequency, a percentage of the overall total, a percentage of the row total, or a percentage of the
column total
Organizing Numeric data
Create ordered arrays or distributions
Common Analysis
Analyze the numerical variables by groups that are defined by the values of a second, categorical variable
Stacked Format
All of the values for a numerical variable appear in one column and a second, separate column contains the
categorical values that identify to which subgroup each numerical value belongs
Unstacked format
Values for a numerical variable are divided by subgroup and placed in separate columns
Ordered Array
Arranges the values of a numerical variable in rank order, from the smallest value to the largest value
Helps to get a better sense of the range of values in the data
Frequency Distribution
Tallies the values of a numerical variable into a set of numerically ordered classes
Class Interval each class groups a mutually exclusive range of values
Each value can be assigned to only one class, and every value must be contained in one of the class intervals
Creating a frequency distribution
Consider how many classes would be appropriate for your data (5 to 15 is ideal)
Determine a suitable width for each class interval (subtract the lowest value from the highest value and divide that
result by the number of classes)
Establish proper and clearly defined class boundaries for each class
Well-chosen class intervals lead to class midpoints that are simple to read and interpret