Documente Academic
Documente Profesional
Documente Cultură
Anirban Mondal
Time-efficiency
Practical considerations
When sampling, what is perhaps the most important thing that you may want to consider?
Example 1: Suppose you want to find out what percentage of the population in a country supports a given political party. And you set up an Internet poll to collect this data.
Will the above approach provide you with a good (representative) sample?
Depends on Internet penetration and which demographics use the Internet
Probability Sampling
Simple Random
Cluster Stratified
Systematic
Systematic sampling
Randomly start from any item and then keep selecting every Nth item Example: In a relational database, suppose you select record no. 2. And assume N = 7
Now keep selecting every 7th record i.e., records 9, 16, 23
Alternatively, you can visualize your dataset in terms of frames to do systematic sampling
End-result will be the same
Stratified Sampling
Organize the population into multiple categories
Each category is called a strata Example: Assume 5 different ranges of monthly income for a given population
People in each income range are part of a strata
Now collate samples from all the strata Applications: Marketing, elections etc
How many people with monthly income in the range of $7000 - $8000 dollars are likely to buy a given product? How many people belonging to a certain ethnicity/race are likely to vote for a given political party?
Cluster sampling
Divide the population into clusters and randomly select a sample of these clusters For the selected clusters, you can choose all the items or you can select only a sample of the items Applications: Area sampling or geographical cluster sampling, marketing, political polls
Survey Sample
Suppose there is a survey for evaluating public opinion about raising the retirement age to 65 Survey questions are as follows:
What is your current age? How would you classify your economic status (very rich, rich, upper middle, middle, lower middle, poor, living below poverty line)? What do you think of increasing the retirement age?
Are these survey questions likely to provide good answers?
What is the estimated length of the survey? (Does it take more than 5-7 minutes to fill up?)
Questions (Cont.)
Was the data collection done in an anonymous manner? (Privacy issue)
Questions (Cont.)
Were the respondents over-awed by the interviewer? (halo effect) Were the questions easy to understand?
Most importantly, the ability to think about data mining from a broader perspective and to apply data mining in the real-world