Sunteți pe pagina 1din 24

Data Mining: An applications perspective

Anirban Mondal

Contents of the lecture


Information overload Overview of Statistical sampling Data collection: How to conduct effective surveys? What you have learned in this course..

Effectively mining huge amounts of data


With so much data around (and it only keeps on increasing), how can we effectively perform data mining?

Statistical sampling of the data

Why is sampling important?


Cost-effectiveness

Time-efficiency

Practical considerations

In the real-world, sampling sometimes becomes an inevitable compromise

When sampling, what is perhaps the most important thing that you may want to consider?

Does a sample properly represent the population?


A sample should be representative of the population, otherwise the very purpose of sampling is defeated.

Example 1: Suppose you want to find out what percentage of the population in a country supports a given political party. And you set up an Internet poll to collect this data.

Will the above approach provide you with a good (representative) sample?
Depends on Internet penetration and which demographics use the Internet

Does a sample properly represent the population?


Selection bias occurs when certain parts of the population are left out of the sample.

Selection bias could be intentional or unintentional.

Selection bias almost always produces erroneous results.

Probability Sampling

Simple Random

Cluster Stratified
Systematic

Simple Random Sampling


Example: Suppose there are 100 items numbered 1 to 100. And you need to select any 10 of these items for your sample. Simply use a random number generator

Each item has the same chance (probability) of being selected

Systematic sampling
Randomly start from any item and then keep selecting every Nth item Example: In a relational database, suppose you select record no. 2. And assume N = 7
Now keep selecting every 7th record i.e., records 9, 16, 23

Alternatively, you can visualize your dataset in terms of frames to do systematic sampling
End-result will be the same

Stratified Sampling
Organize the population into multiple categories
Each category is called a strata Example: Assume 5 different ranges of monthly income for a given population
People in each income range are part of a strata

Take samples from each strata


Generally, sample size is decided based on strata size

Now collate samples from all the strata Applications: Marketing, elections etc
How many people with monthly income in the range of $7000 - $8000 dollars are likely to buy a given product? How many people belonging to a certain ethnicity/race are likely to vote for a given political party?

Cluster sampling
Divide the population into clusters and randomly select a sample of these clusters For the selected clusters, you can choose all the items or you can select only a sample of the items Applications: Area sampling or geographical cluster sampling, marketing, political polls

Data collection and surveys

Data collection and surveys


Before the data can be mined, the data needs to be collected An important method of data collection is survey. Applications
Marketing research
How many people are satisfied with a specific product? What percentage of people are likely to return a product in lieu of a refund?

Public policy research


What percentage of the population is likely to support a new govt policy e.g., increase in retirement age to 65?

Political science research


What percentage of the population in the world would be likely to support the death penalty?

Surveys and quality of data collected


In practice, the quality of data collected via survey depends significantly on how well the survey had been designed Now let us see a sample survey

Survey Sample
Suppose there is a survey for evaluating public opinion about raising the retirement age to 65 Survey questions are as follows:
What is your current age? How would you classify your economic status (very rich, rich, upper middle, middle, lower middle, poor, living below poverty line)? What do you think of increasing the retirement age?
Are these survey questions likely to provide good answers?

Survey Sample (Cont.)


Suppose there is a survey for evaluating public opinion about raising the retirement age to 65 Survey questions are as follows:
What is your current age? (Privacy issue, people may not want to state their age) How would you classify your economic status (very rich, rich, upper middle, middle, lower middle, poor, living below poverty line)? (Privacy issue AND the question is too subjective) What do you think of increasing the retirement age? (The question is too open-ended, will be difficult to meaningfully interpret the final results of this survey question.)

Survey Sample (Cont.)


Suppose there is a survey for evaluating public opinion about raising the retirement age to 65 Survey questions are as follows:
What is your current age? Which is your age group? (Less of a privacy issue, better chances of getting an answer) How would you classify your economic status (very rich, rich, upper middle, middle, lower middle, poor, living below poverty line)? Within what range does your net worth fall into? (Less of a privacy issue and less subjective) What do you think of increasing the retirement age? Would you prefer that the retirement age be increased? (This is not an open-ended question, hence it will be easier to meaningfully interpret the final results of this survey question.)

Questions to evaluate survey quality

What sample of the population was included in the survey?

Did the sample adequately represent the population?

What is the estimated length of the survey? (Does it take more than 5-7 minutes to fill up?)

Questions (Cont.)
Was the data collection done in an anonymous manner? (Privacy issue)

Was any peer pressure involved?

Questions (Cont.)
Were the respondents over-awed by the interviewer? (halo effect) Were the questions easy to understand?

Were there any leading questions?

How many of the respondents did not answer?

And many other questions!

Ethical issues in a Survey


Which organization did this survey?

Is there any conflict of interest?

Do they have any hidden agenda?

What you have learned in this course..


Data mining techniques
With specific focus on association rule and clustering approaches A good number of the most important data mining techniques have been covered in this course to provide you with a suite of data mining techniques

Understanding the applications of data mining in the real-world


And the ability to determine trade-offs between data mining techniques for deciding which technique to use for a given application And the ability to modify data mining techniques to cater to the needs of various kinds of applications

What you have learned in this course..


Exposure to important research papers in the data mining field Project: Experience in implementing data mining techniques to solve a real-world problem
Practical issues concerning how to specify inputs to data mining algorithms

Most importantly, the ability to think about data mining from a broader perspective and to apply data mining in the real-world

S-ar putea să vă placă și