Sunteți pe pagina 1din 4

IDS 472 Assignment 2 & 3 Background

Direct Mail Fundraising

A national veterans organization wishes to develop a data mining model to improve the costeectiveness of their direct marketing campaign. The organization, with its in-house database of over 13 million donors, is one of the largest direct mail fundraisers in the United States. According to their recent mailing records, the overall response rate is 5.1%. Out of those who responded (donated), the average donation is $13.00. Each mailing, which includes a gift of personalized address labels and assortments of cards and envelopes, costs $0.68 to produce and send. Using these facts, we take a sample of this dataset to develop a classication model that can eectively capture donors so that the expected net prot is maximized. Weighted sampling is used, under-representing the non-responders so that the sample has equal numbers of donors and non-donors. Data The le Fundraising.xls contains 3120 data points. The sample has been balanced to carry equal proportions of donors and non-donors, i.e.the data has 50% donors (TARGETB = 1) and 50% nondonors (TARGETB = 0). The amount of donation (TARGETD) is also included but is not used in this case. The descriptions for the 25 variables (including two target variables) are listed in the Table below.

Assignment In this assignment, we examine the performance of different modeling techniques Logistic regression, decision trees, and k-nearest neighbor classifiers. Performance should be evaluated based on costs and benefits as given above; we want a model that maximizes profit. We will break this work up into two phases, each of which will count as an assignment.

Phase 1 (Assignment 2A): [due next Sunday, Feb 19th] Examination of the data, data transformations, PCA, etc. Building logistic regression and decision tree models. Assess model performance. Do different variables/transformations influence performance? Phase 2 (Assignment 2B): [due Sunday, Feb 26th] Include profit calculations in model building Develop models using all 3 techniques, optimize variable selection, select the best model Apply the best model to unseen data (separate file, without values of dependent variable) Specify how this model should be applied to select donors. Performance on unseen data will be determined during assignment evaluation (and team(s) with maximum profit model will get extra credit points).

Phase 1 Step 1: Import the data, and examine the different variables distribution of values, mean and std deviation, range of values. What do you observe? What variable transformations do you make (and why)? Step 2: Model Building (a) Partitioning - Partition the dataset into 60% training and 40% validation (set the seed to 12345). [A specified seed ensures that we obtain the same random partitioning every time we run it. With no specified seed, the system clock is typically used to set the seed, and a different partitioning can result in different runs]. (b) Selecting classication tool and parameters. Run the following classication tools on the data: Logistic Regression Decision Trees (try multiple tree types; do you find performance differences; which will you use) Be sure to test dierent parameter values for each method. You should also try to run each method on a subset of the variables (how do you select this subset?). Be sure NOT to include TARGETD in your analysis.

Phase 2 Q1. Classication under asymmetric response and cost: What is the reasoning behind using weighted sampling to produce a training set with equal numbers of donors and non-donors? Why not use a simple random sample from the original dataset? (Hint: given the actual response rate of 5.1%, how do you think the classication models will behave under simple sampling)? In this case, is classication accuracy a good performance metric for our purposes of maximizing net prot? If not, how would you determine the best model? Explain your reasoning. 2. Develop a k-nearest neighbor model. Explain what you do, parameters, etc. (How do you obtain the best k?). Compare performance across all 3 techniques (you may use different models than those submitted in Phase 1). Calculate Net Prot: For each method (choose the best model for each method/technique, either with the full or reduced set of variables), calculate the lift of net prot for both the training and validation set based on the actual response rate (5.1%). Again, the expected donation, given that they are donors, is $13.00, and the total cost of each mailing is $0.68. (Hint: to calculate estimated net prot, we will need to undo the eects of the weighted sampling, and calculate the net prot that would reect the actual response distribution of 5.1% donors and 94.9% non-donors.) 3. Draw Lift Curves: Draw each models net prot lift curve for the validation set onto a single graph. Are there any models that dominate? 2

5. Best Model: From your answers above, what do you think is the best model? (What criteria do you use to determine best?) Summarize the performance of the best model from each method, in terms of Net Profit from predicting donors in the validation dataset; at what cutoff is the best performance obtained?

Step 3: Testing -The le FutureFundraising.xls contains the attributes for future mailing candidates. Using your best model from Step 2 (#5), which of these candidates do you predict as donors and non-donors? List them in descending order of probability of being a donor. What cutoff do you use to predict donor/non-donor? Submit this file (xls format), with your best models predictions (prob of being a donor). [The FutureFundraising.xls data file does not contain values of the target variable so you cannot really see how your model performs on this data. In evaluating your assignment, the instructor will determine how your best model performs on this data. Note that part of each teams evaluation will be based on how well you model performs relative to other models submitted by other teams.]

Turn in: One xls file, with all results. Please label your tabs. The first tab should describe the contents of the different tabs. The predictions for the FutureFundRaising data should be on a separate tab.

Table 13.7: Description of Variables for the Fundraising Dataset ZIP : Zipcode group (zipcodes were grouped into 5 groups; only 4 are needed for analysis since if a potential donor falls into none of the four he or she must be in the other group. Inclusion of all ve variables would be redundant and cause some modeling techniques to fail. A 1 indicates the potential donor belongs to this zip group.) 00000-19999 1 (omitted for above reason) 20000-39999 zipconvert2 40000-59999 zipconvert3 60000-79999 zipconvert4 80000-99999 zipconvert5 HOMEOWNER 1 = homeowner, 0 = not a homeowner NUMCHLD Number of children INCOME Household income GENDER Gender, 0 = Male, 1 = Female WEALTH Wealth Rating Wealth rating uses median family income and population statistics from each area to index relative wealth within each state. The segments are denoted 0-9, with 9 being the highest wealth group and zero being the lowest. Each rating has a dierent meaning within each state. HV Average Home Value in potential donors neighborhood in $ hundreds ICmed Median Family Income in potential donors neighborhood in $ hundreds ICavg Average Family Income in potential donors neighborhood in hundreds IC15 Percent earning less than 15K in potential donors neighborhood NUMPROM Lifetime number of promotions received to date RAMNTALL Dollar amount of lifetime gifts to date MAXRAMNT Dollar amount of largest gift to date LASTGIFT Dollar amount of most recent gift TOTALMONTHS Number of months from last donation to July 1998 (the last time the case was updated) TIMELAG Number of months between rst and second gift AVGGIFT Average dollar amount of gifts to date TARGETB Target Variable: Binary Indicator for Response 1 = Donor, 0 = Non-donor TARGETD Target Variable: Donation Amount (in $). We will NOT be using this variable for this case.

S-ar putea să vă placă și