Sunteți pe pagina 1din 71

ST 635 Statistics Project

Team members: Chih Ying, Lee Praveena Mani Sandipan Sen

Table of Contents
INTRODUCTION ............................................................................................................................................. 4 Objective ................................................................................................................................................... 4 Dataset description ................................................................................................................................... 4 Data validation .......................................................................................................................................... 4 ANALYSIS ....................................................................................................................................................... 6 Description of variables ............................................................................................................................ 6 Age ........................................................................................................................................................ 6 Job ......................................................................................................................................................... 7 Marital ................................................................................................................................................... 7 Education .............................................................................................................................................. 8 Default................................................................................................................................................... 8 Balance .................................................................................................................................................. 8 Housing ................................................................................................................................................. 9 Loan....................................................................................................................................................... 9 Contact .................................................................................................................................................. 9 Day ...................................................................................................................................................... 10 Month ................................................................................................................................................. 10 Duration .............................................................................................................................................. 10 Campaign ............................................................................................................................................ 11 Pdays ................................................................................................................................................... 11 Previous............................................................................................................................................... 12 Poutcome ............................................................................................................................................ 12 HYPOTHESIS ................................................................................................................................................ 13 METHODOLOGY .......................................................................................................................................... 14 Identification of important variables ...................................................................................................... 14 Decision rules .......................................................................................................................................... 15 Customers least likely to subscribe ..................................................................................................... 20 Customers most likely to subscribe .................................................................................................... 20 Underlying patterns among variables..................................................................................................... 21

Age and job ......................................................................................................................................... 22 Age and marital status ........................................................................................................................ 22 Job and education ............................................................................................................................... 22 Job and housing .................................................................................................................................. 23 Housing and month............................................................................................................................. 23 Contact and month ............................................................................................................................. 24 Pdays and poutcome........................................................................................................................... 24 Selection of relevant variables ................................................................................................................ 24 Subscription Model (with poutcome) ..................................................................................................... 25 Non-significant factors ............................................................................................................................ 25 Testing our Hypothesis ........................................................................................................................... 26 Subscription Model (with pdays) ............................................................................................................ 30 CONCLUSION............................................................................................................................................... 32 Recommended Marketing Strategy ........................................................................................................ 32 EXHIBIT ........................................................................................................................................................ 33 REFERENCE .................................................................................................................................................. 46

INTRODUCTION
Objective
We want to predict the chance that a customer will subscribe to a Certificate of Deposit. A Portuguese banking institution conducted a mass marketing campaign to sell CD subscriptions back in 2008-10. Our goal is to analyze the socio-economic life style of customers contacted as part of this campaign and derive a statistical model based on which we can predict the outcome for any similar marketing campaign from the Portuguese bank in the future. The idea is to develop a targeted marketing strategy on the basis of patterns observed in the historical data.

Dataset description
We have identified a dataset related to a direct marketing campaign of a Portuguese banking institution. The dataset contains detailed information on potential customers who were contacted as a part of the campaign. Data was randomly collected during the period May 2008 to November 2010 by making phone calls to the clients. Often more than one contact to the same client was required to access information on whether the client would subscribe to the CD. Our dataset consists of 17 different attributes, which are a combination of 7 numeric and 10 categorical variables, and has a sample size of 45,211 records. Various parameters such as age, job, marital status, credit default, education, etc. have been taken into consideration.

Data validation
During our data validation we came across a few variables that had unknown values but werent classified as missing by our data source. Lets consider the variable poutcome, which signifies the outcome of a previous marketing campaign. Values of success or failure are selfjustifiable. A value of other means that a previously contacted customer couldnt decide
4

whether he will subscribe to a CD. He was probably not sure of a subscription during the previous campaigns but didnt necessarily rule out the option of doing so at a later point in time. However for cases where the poutcome value assumes unknown, none of the above scenarios can be justified. Our further investigation revealed a high collinearity of poutcome with the variable pdays. Except for 5 records rest all have a pdays value of -1 when poutcome is unknown. Pdays indicates the number of days that had passed by after the client was last contacted from a previous campaign and a value of -1 implies that the client wasnt previous contacted. This high correlation leads us to believe that a poutcome of unknown simply means that the client was not contacted before during any previous marketing campaign and is therefore not a missing value. The 5 records for which poutcome are unknown were considered erroneous entries and hence we decided to rule them out from our analysis. An unknown value in the job variable indicates that the occupation of the individual doesnt fall under any of the other 11 categories profiled by the bank. Variable contact, which denotes the mode of communication the bank used in contacting the customer, has unknown as one of the possible values. It means that people didnt share their contact information. These people were contacted through other means such as mail offers, electronic emails or a personal visit by the bank sales representative.

ANALYSIS
Description of variables
Our first descriptive analysis was conducted on our dependent variable. The objective for our research is to define the strategy for a targeted marketing campaign in the future. In order to do so we first needed to understand how the current marketing campaign performed. A descriptive analysis of the subscription variation suggests that out of 45,206 customers contacted only 5,287 of them had subscribed to the banks CD, a success rate of 11.7%. This is quite a low performance considering the amount of time and money wasted in contacting these customers not only once but repeatedly. Was repeated phone calls a good idea? Was the bank able to target the right set of customers based on their socio-economic behavior? What amount of resource was wasted behind those customers who didnt carry the potential to subscribe? For us to be able to answer such questions we had to draw various hypotheses, prove or disprove them and finally collate the results together to identify the right customer profile.

Age
The distribution of age is not normal. However since our sample size is quite large, as per Central Limit Theorem, we are 99% confident that the average age of targeted customers was around 40 years (Exhibit 3-b). Upon performing the descriptive analysis with only successful subscription cases the results didnt change much. So it looks like the bank typically kept targeting people around 40 years old and hence the majority of the subscription cases came from this target group. However we tried to categorize them into logical age groups and found out that people around the age of 40 are among the least likely to subscribe to a CD. It seems people between 18 to 27 years of age, which includes undergraduate students, young professionals or

people pursuing their masters, have a good chance of subscription and also beyond 60 years of age peoples tendency to subscribe to a CD increases (Exhibit 1-a). This seems logical because people tend to retire after 60 and therefore a CD becomes the only source of income for their family.

Job
Analysis of successful subscription by job category revealed that students have the highest chance of subscription followed by retired and unemployed people. People working in management and administrative positions are also quite likely to subscribe. A cross sectional analysis of job category vs. age groups suggests that management and administrative positions are majorly filled with people between 28 to 37 years of age (Exhibit 1-b). It is usually the peak time of ones life when people form families, have children and look out for additional sources of income. Therefore management and administrative workers can form a good target group. Following them is the group of people who are self-employed or have started their own business. Such individuals are always on the lookout for extra sources of cash probably because of the volatility of their business, requirement for extra funding in the future or incentive to save taxes.

Marital
A person was listed as either single or married or divorced. Among them married couples were targeted heavily followed by singles. From an analysis of success rates achieved in either of these categories we found that singles were most likely to subscribe to a CD followed by people who were divorced. Interestingly married couples who were the main target customers for the bank were ranked lowest (Exhibit 1-c).

Education
In Portugal, the education system is divided into three categories primary, secondary and tertiary. Primary education is free and compulsory for 9 years. Beyond that starts secondary education which is basically three years of education 10th, 11th and 12th. Higher education post the 12th is classified tertiary and includes undergraduate, masters or doctoral programs. Our dataset contains another category called unknown for the highest level of education received by a customer. Our research indicates that such cases occur when the customer decides not to disclose this information. From our analysis we found that people with tertiary education had the highest subscription rate compared to other. A general pattern that can be inferred from the graph is that as the level of education increases the subscription rate increases (Exhibit 1-d).

Default
The default variable measures whether a customer has defaulted in his/her credit payments. An overall indication of how efficiently the customer manages his/her credit score. About 11.79% of the customers who havent defaulted subscribed to a CD as compared to 6.38% of those customers who did. Also very few people with defaulted credit were contacted for the campaign, about 815 as opposed to 44,391 people who didnt default (Exhibit 1-e).

Balance
Distribution of the average yearly balance is not normal. It is highly right skewed, similar to what we observe generally for the distribution of income among people. The banks main targets were people with lower yearly balance in their account. However it looks like there wasnt any significant difference in subscription rates in the other yearly balance categories. We observed

two cases in the 80,000 to 90,000 Euro range, which had 100% subscription rate; however we dont have enough data to conclude whether this didnt happen merely by chance (Exhibit 1-f).

Housing
Almost 56% of the people who were approached during the campaign had a house loan. However, only 7.7% subscribed. On the contrary, 16.7% of the people who didnt have a housing loan underwent subscription. It seems it is easier for the bank to convince customers who do not have a housing loan (Exhibit 1-g).

Loan
Same as housing loan, if customers do not have any personal liabilities or debts to pay off, likelihood of them subscribing to a CD is more (Exhibit 1-h).

Contact
Majority of the customers were contacted through cellphone. The second most common way of reaching them was through mail offers, newsletters, or a bank sales representative visiting them personally. The least used method was to reach them on their landline. Our analysis shows that people tended to respond positively when contacted via cellphone more than when contacted via landline; and were least responsive to any other modes of communication. Cellphones and landlines offer the flexibility of negotiating the terms and conditions of a deal, whereas mail offers may be too generic. On the other hand a sales representative visit looks too aggressive. This leads us to believe that more reachable and interactive the communication is with the customer, the more probable he/she is to subscribe to a CD (Exhibit 1-i).

Day
We found that no matter during which day of the month the customers were contacted, the subscription rate remains fairly constant. Probably day of the month is not a very good predictor in our analysis (Exhibit 1-j).

Month
Subscription is highest during the months of March, September, October and December. These months are usually the festive seasons in Portugal. The country celebrates Rio-style carnivals during the month of March and the year-end is filled with events such as their Independence Day, Christmas, etc. High subscription rate during the festive season could be because of banks offering attractive interest rates or flexible deposit plans during the period (Exhibit 1-k).

Duration
Duration measures the time spent on call during the last contact with the customer. Data reveals that people who spent less than 10 minutes on call were less likely to subscribe to a CD. On the other hand call durations that lasted greater than 10 minutes show good subscription rates. In a typical call, a bank representative may take approximately 5-7 minutes to explain the initial set of terms and conditions of a plan to the customer. Rest of the time is mostly spent on discussing the Q&A, customers have. Someone who is not interested in any sort of proposal is less likely to prolong the call. However, longer call durations suggest that customers are more willing to hear the details of a subscription plan and probably have an interest. Most success came from calls lasting anywhere between 20-50 minutes (Exhibit 1-l).

10

Campaign
The campaign variable gives us the count of phone calls made to the customer during the current campaign that lasted from May 2008 to November 2010. Our analysis shows that people who were contacted 1-5 times during the campaign had a subscription rate of 12.32%, slightly higher than the overall success rate. Apparently most of the customers fell into the range of 1-5 phone calls. However, when people were contacted repeatedly more than 5 times, the subscription rate fell drastically. Generally people who are interested in fixed deposits will readily subscribe to a CD without being urged. And for those who are not, repeated calls arent going to change their minds all of a sudden; instead it might lead to more frustration and reduce chances of subscription further (Exhibit 1-m).

Pdays
Pdays measures the number of days that passed by after the client was last contacted from a previous campaign. We found that new customers were the main target for the bank, comprising almost 82% of the entire group of people who were contacted. It turns out the amount of subscription that came from new customers was considerably low (below the overall success rate) as opposed to what came from previously contacted customers. Among such customers, when there was a gap of more than a year from the last contact, the subscription rate was quite high. This may be because of a good relationship of the customer with the bank in the past year or returning customer who were happy with previous subscriptions. It has to be noted that customers who were contacted within 3 months also showed a good subscription rate of 43.25%, which possibly represents an ongoing campaign with customers in the formal process of subscription (Exhibit 1-n).

11

Previous
The marketing campaign has primarily focused on new customers who have been contacted less than 9 times in previous campaign. For ex: 44,845 customers have been contacted in the range of 0-9 times with a success rate of 11.61%. Subscription rate was however high (23.83%) for customers who were contacted 10-19 times before this campaign. Beyond that subscription rate fell again. Generally in order to maintain a good reputation with customers, an optimum number of interactions are necessary. Customer can perceive too few calls as a sign of disinterest from the bank as well as too many calls may be thought of as an oversell. The bank should carefully consider their customer retention strategy. Interestingly we found a tremendously high subscription rate of 66.67% among customers who were contacted 50-60 times. However this could be more like a case of chance because only 3 customers were contacted in that range out of which only 2 subscribed (Exhibit 1-o).

Poutcome
Analysis of subscription data based on the results of the previous marketing campaign clearly shows that if the previous campaign was successful for a particular customer, then there is a higher chance that the customer will subscribe in the current campaign. This can be attributed to several factors such as trust developed with the bank, higher satisfaction rate, etc. Therefore the bank could focus their marketing efforts that target customers who opened an account during previous campaigns (Exhibit 1-p).

12

HYPOTHESIS
Our discussion of the variables has raised few interesting questions which we would like to answer. In order to do so we have formulated the following hypotheses which we will prove or disprove going forward.

H1: Students or young professionals (18-27) and people at the verge of retiring (60) have a higher chance of subscription. H2: Customers who are singles are more likely to subscribe than when they are married. H3: The chance of subscription increases with a higher degree of education. H4: People who have a good credit history are good targets. H5: People with less financial liability such as personal loan are more likely to subscribe H6: People who are more reachable are more likely to subscribe H7: Chances of subscription increases during the months of festive season H8: The more the time spent on the call the more likely customers are to subscribe. However repeated calls reduce that chance H9: Returning customers have higher chance of subscription H10: Subscription chances are higher if customers are contacted a year after a previous campaign

13

METHODOLOGY
Identification of important variables
As a first step in the process of evaluating our hypotheses we wanted to understand the importance of each variable in the marketing campaign. For that we conducted a decision tree analysis involving all the variables. The dataset was partitioned into groups, with 2/3rd being used as training data and 1/3rd as validation data. Probability Chi-square statistics was used as our splitting criteria. Also because our dataset is large, we assumed that in order to form a significant group, there should be at least 100 customers in it or else we wouldnt consider it as a meaningful categorization.

Analysis reveals that the misclassification rate is quite low. Only about 9% of the observations were categorized erroneously and our validation dataset follows this statistic very closely (Exhibit 2-a). It means that our decision tree model is reliable. The variable importance chart reflected some interesting findings. A good number of variables which we thought had considerable relevance to the bank marketing campaign were deemed unimportant. For example, we believed that bearing housing and personal loans were bad indicators of subscription. We thought contacting the customers repeated is going to deteriorate the chance of their subscription. Also educational qualification was thought to be an important consideration; needless to say we felt credit history of a person had substantial significance in the context of subscription chances. The variables that contributed to the model in a decreasing order of importance are duration, poutcome, month, age, marital status and contact (Exhibit 2-b). On careful investigation of the leaf nodes we found that the decision tree has paid more focus on classifying those types of customers who are less likely to subscribe rather than identifying those

14

cases which are probably going to be a success. This works like a process of elimination, in which instead of trying to find an ideal match we get rid of cases that are unlikely to subscribe. So even though we dont know who is the right set of customers, we know with most certainty who we shouldnt target. It makes sense because there could be several other missing factors, which havent been taken into account in the campaign that can increase the probability of success.

Decision rules
The decision tree in Fig () displays the results of the marketing campaign to sell CD subscriptions. Customers who subscribed at the end of the campaign were coded 1. The root node shows that, of the 30,286 customers in the training dataset who were targeted, 11.7% subscribed to a CD whereas 88.3% did not (coded with 0).

This decision tree could be used by the bank at several different points in making decisions on which groups of customers they should focus their marketing campaign (Exhibit 2-c). When the duration was less than 8.68 minutes VERSUS when the duration was greater than or equal to 8.68 minutes Under the root node, the first categorization of subscription was done based on duration. Duration was the most important factor in predicting the subscription. This factor has been applied several times for categorization in the decision tree. It is a general perception that when a person is interested in a bank product or any other service, the time spent on call with them will be more. When looking at duration of less than 8.68 minutes versus duration of greater than equal to 8.68 minutes, people who spend more time over the call (>8.68 minutes) have a higher subscription rate of 44.1% than people who spend less than 8.7 minutes who have a very low
15

subscription rate of just 7.7%. This also reconfirms the perception we had earlier. The bank could make use of this rule and focus on improving customer service by making attractive offers or training the representatives to keep the customers engaged on call for a longer duration. When the duration was less than 8.68 minutes and poutcome was Successful VERSUS when poutcome was Failure or Unknown When time spent on call was less than 8.68 minutes, customers were further categorized based on the outcome of the previous marketing campaign. One could expect that customers who spend less time over the call did not prefer the bank service. When looking at the decision tree, it clearly shows that customers who subscribed to a CD in the previous campaign (coded success) had a high subscription rate of 62.4% whereas customers who did not subscribe previously or were never contacted before (coded failure, unknown) had a very low subscription rate of 5.9%. From this decision rule, even if less time was spent on call, the bank should filter customers who had subscribed previously and put their marketing efforts towards them. When poutcome was successful and duration was less than 2.21 minutes VERSUS when duration was between 2.21 minutes to 8.68 minutes When the previous outcome was successful, the customers were further categorized once again based on duration. From the decision trees, it is clear that people with successful previous campaign and time spent between 2.21 minutes to 8.68 minutes have a good subscription rate of 71.7% versus people who spend less than 2.21 minutes who have a low subscription rate of 21.5%. Therefore even if the previous marketing campaign was successful and the second time

16

customers are easier to target, the bank representatives should try to keep the customers engaged for at least 2.21 minutes to increase the chances of subscription. When poutcome was failure, unknown or other, duration was less than 8.68 minutes and month was October, March, September VERSUS when month was January, February, April, May, June, July, August, November and December When poutcome was failure, unknown or the customer was unsure, it was further categorized by months. During the months of March, September and October the subscription rate was 37.7% which was not distinctive enough to claim that month was influential in deciding the subscription rate. Whereas for the rest of the months we can clearly see that the chances of subscription was quite low 4.7%. March, September and October are the festive months in Portugal. So in order to determine the subscription chances during these months we would have to look at other factors. When previous outcome is failure, unknown or other, month is October, March, September and duration is less than 2.9 minutes VERSUS when duration is greater than 2.9 minutes but less than 8.7 minutes As discussed before that we needed other factors to be considered for the months of March, September and October, the decision was based on duration i.e. lesser the time spent on call the lower is the subscription rate. People who spend less than 2.9 minutes had a subscription rate of 17.6%. Therefore we can say for sure that subscription is most unlikely. Whereas those who spend more than 2.9 minutes have subscription rate of 57.4% but we should carefully consider other factors when making a decision. Even if more time was spent on call in these months we could not tell for sure whether the subscription would happen.
17

When previous outcome is failure, unknown or other, month is January, February, April, May, June, July, August, November, December and duration is less than 4.31 minutes VERSUS when duration is greater than 4.31 minutes but less than 8.7 minutes People who were contacted during these months had a very low subscription rate of 4.1% as observed before; and if these people spent less than 4.31 minutes on call their subscription rate decreased further to 2.6%. However people who spent more than 4.31 minutes have a subscription rate of 11.0%, which is still low. But we will see in our upcoming analysis how age could be a deciding factor. When duration is greater than 4.31 minutes but less than 8.7 minutes, previous outcome is failure, unknown or other, month is January, February, April, May, June, July, August, November, December and age is less than or equal to 60.5 years VERSUS age is greater 60.5 years In the above splitting rule, we mentioned that age could be a deciding factor. People who spent greater than 4.31 minutes but less than 8.7 minutes but who were contacted during the months of January, February, April, May, June, July, August, November, December followed the same pattern of lower subscription rate if they were younger than 60.5 years old. But for people who are older than 60.5 years this pattern is no longer true. People beyond that age had fairly equal chances of subscribing or not subscribing. When duration was greater than 8.70 minutes but less than 13.8 minutes and poutcome was success VERSUS when poutcome was unknown or failure Under the node where duration is greater than 8.70 minutes but less than 13.8 minutes were further categorized based on the outcome of the previous marketing campaign. We could again
18

predict that people who had subscribed previously with the bank and spent a good amount of time on the call have a higher chance of subscribing again. And decision tree confirms this; people who had a successful previous marketing campaign had a higher subscription rate of 83.3% versus people who didnt respond positively in the previous marketing campaign who had a low subscription rate of 33.3% When duration is greater than or equal to 13.8 minutes and Marital status was single, divorced VERSUS when marital status was married The group of customers who have spent considerably high amount of time (>13.8 minutes) over the call were further categorized based on marital status. Irrespective of the marital status, people who spent the most time over the call had a decent subscription rate. People, who were single, divorced had a subscription rate of 63.8% and people who were married had a subscription rate of 54.4%. Given a chance the bank should focus more on customers who are single or divorced than married people. When duration was greater than or equal to 13.8 minutes, marital status was married and contact type was cellular VERSUS when contact type was unknown Among people who are married and had spent more than 13.8 minutes over the call are further categorized based on the contact method. People who were easily reachable had a higher chance of subscription. For example if the contact method was cellphone the subscription rate was 58% whereas people were contacted through snail mail or e-mail had a lower subscription rate of 44.5%. Even though the subscription rate when contacted through cellphone was not significantly high it was relatively greater than being contacted via mail and e-mail. So it is

19

better to reach customers through cellphone or landline than any other method as it increases the chance of customers being engaged in a human interaction.

Customers least likely to subscribe


Therefore the cases we know for sure in which customers are least likely to subscribe are:

1. Customers who have previously subscribed to a CD and spend less than 2.2 minutes on call (predictability of 78.5%). 2. Customers who are being contacted for the first time or failed to subscribe during a previous marketing campaign, spend less than 4.3 minutes and are contacted during nonfestive months (predictability of 97.4%). 3. Customers, who are contacted during the non-festive months, either failed to subscribe during a previous marketing campaign or are first timers but spend between 4.3 - 8.7 minutes on call and are younger than 60.5 years old (predictability of 89.9%). 4. Customers, who are contacted during festive months, spend less than 2.9 minutes on call and those who didnt subscribe during a previous marketing campaign or were being contacted for the first time (predictability of 82.4%).

Customers most likely to subscribe


Similarly, cases in which we know for sure customers are most likely to subscribe are: 1. Customer who spends between 2.2 13.8 minutes on call and had previously accepted a subscription offer (predictability of 73.3%).

20

Underlying patterns among variables


Now that we have identified the set of important variables and rules to be considered for the bank marketing campaign we want to validate our analysis by fitting a logistic regression model to the data. But before we proceed, we would like to take a closer look at the correlation among the independent variables. This is necessary because we need to prevent multicollinearity issues from creeping into our model, which can inflate the coefficient estimates of our variables. Interesting enough, we didnt find any correlation among the numerical variables in our dataset (balance, duration, campaign and previous). The correlation and scatter plot matrix suggested very mild association, which will be of no concern in our analysis (Exhibit 4). Below is a tabular representation of the association among the variables:

balance balance duration campaign previous

duration

campaign

previous

Strong correlation No correlation

We were more curious about any association between the categorical variables because they formed the majority of our dataset. We carried out chi-square test of association between each pair of categorical variables and looked at their Crammers V statistics to identify any underlying relationship. A Crammers V estimate of 0.25 or higher suggests strong association between the variables whereas anything below is acceptable. The table below tells us which of the categorical variables have strong association:

21

Age Age Job Marital Education Default Housing Loan Contact Month Pdays Poutcome

Job

Marital

Education

Default

Housing

Loan

Contact

Month

Pdays

Poutcome

Strong correlation

Age and job


We found age and job to be correlated. This makes sense because as people tend to grow older they get promotions and move to better job positions. For example a fresh out of college student is more likely to be placed in professional services or technical job as opposed to someone who is middle aged or around 60, who are more likely to be occupying management positions, selfemployed or retired (Exhibit 5-a).

Age and marital status


Similarly age was also correlated with marital status, which makes even more sense because young people tend to be single more often than people in their 30s. And as they grow older they either remain married or get divorced (Exhibit 5-b).

Job and education


The higher the educational qualification of a person the higher are the chances of finding a sophisticated job. A person with Masters or PhD is more probable of serving management

22

positions or doing sophisticated technical jobs that require high degree of technical qualifications whereas someone with only basic school education is likely to end up in a blue collar profession (Exhibit 5-c).

Job and housing


The kind of people who are prone to taking a housing loan can be explained by their job profile. Lets take a case by case example. People who do not have job or have very low income such as students, housewives or retired employees are very likely not to take a housing loan because of the apprehensions in repayment. Moreover their background history may not be suitable enough for banks to extend such credit. Low salaried people on the other hand, such as blue collared professionals, clerks or people in services actively seek better living standards. Therefore they are more likely to accept housing loans. For self-employed individuals or persons working in management jobs, who have sufficient level of income do not care much about housing loans because they can afford quality living standards themselves (Exhibit 5-d).

Housing and month


Our data shows that majority of the people were contacted during Q2 and Q3 of the year. Out of that, Q2 had a high focus on people who had a housing loan whereas during Q3, people who didnt have any housing loans were targeted. Less focus was given on reaching out to customers during Q1 and Q4 which are generally the festive seasons in Portugal. This explains the high degree of association between the two variables. As discussed before we found that subscription rate was greater during the festive seasons, which was also supported by our decision tree analysis. We also notice that fairly equal amount of focus was given to people with and without

23

housing loans during Q1 and Q4. This leads us to believe that housing loan probably isnt that important a factor (Exhibit 5-e).

Contact and month


Customers were contacted mostly through mail offers during Q2 and through cellphone during Q3. As we talked about before, less attention was paid in reaching out to customers during the other two seasons (Q1 and Q4) via any means. This explains the correlation but there is no general understanding of such variation in communication type depending upon the season. Moreover our decision tree analysis suggests that both contact and month are important variables under consideration. Therefore we will keep them in our analysis (Exhibit 5-f).

Pdays and poutcome


There is very strong association between pdays and poutcome. New customers and people who didnt subscribe to a CD during a previous marketing campaign were the main focus of the campaign. A large portion of them were contacted within 6 months, and even more within a period of one year. Very few people were contacted after a year with least focus being given to existing customers who had subscribed during the previous campaign. This is generally the case; marketers always try to lure new customers into buying their products, but they often dont focus on servicing existing customers, probably because they take them for granted (Exhibit 5-g).

Selection of relevant variables


We decided to keep job out of our analysis, because of its high dependency on age, education and housing loan. Moreover, as there are several categories in the job variable, keeping them in the equation would be over fitting our model. Since job profiles can be so diverse, we need to keep room for new positions that may pop up in the future. Age and education, which follow a
24

more standard classification, will do a better role. We decided to drop housing loan as well because we didnt think it was a good enough predictor of subscription.

Rest all categorical variables are being kept in our model. Correlation amongst the numeric variables werent found noteworthy, so none of them were dropped. We ran general linear models across all numerical and categorical variables to find if there is any association between them. Most of them showed minor correlation but we couldnt drop any of them because all those variables seemed relevant to the banks marketing strategy. With respect to pdays and poutcome, because of their high degree of association, we want to run two logistic models, one without pdays and the other without poutcome.

Subscription Model (with poutcome)


We got pretty satisfactory results on running our logistic model using the selected set of independent variables we just discussed above (Exhibit 6). Model predictability was quite high at 88.94% as suggested by the c-statistic. Convergence criterion was satisfied for the model to be interpretable and the overall model was significant at level of 0.01, thus indicating that our model is a good fit. There were of course a few outliers, some of which had high leverage and some poorly accounted for by the model. We got rid of them to prevent them from altering our coefficient estimates drastically. Influential diagnostic suggested that our obtained estimates were quite stable after the cleansing.

Non-significant factors
As proposed by our decision tree results, variables default and previous were not statistically significant predictors in our model. Default indicates whether the customer defaulted in paying

25

debts. Although not significant in our model, defaulting to pay loans, which effects customers credit history, plays a crucial role in the financial services industry. Banks verify a persons payment history and background check before extending mortgage loans, car loans, etc. all the time. Therefore we do not want to lose an important like credit history in a banking model. Previous, which accounts for the number of contacts performed before this campaign for a client, was found not significant by a close margin (p-value = 0.0335, = 0.01). Moreover it makes sense to keep knowledge on the amount of effort spent a particular customer, how accustomed they are with the banks products, whether they are new clients, etc. Hence previous was kept in the model as well.

Testing our Hypothesis


Note: All odds estimate between variables have been interpreted holding other variables constant

H1: Students or young professionals (18-27) and people at the verge of retiring (60) have a higher chance of subscription. Turns out, our hypothesis is correct. We hypothesized that young professionals, fresh college pass outs and people 60 years and above are more like to subscribe to a CD. Age is a significant factor in our model with a p-value less than 0.01. For our analysis we had categorized customers into the following age groups: 18-27, 28-45, 46-59 and 60+. Since people between 18-27 and 60+ were our main focus, we used 28-45 as our reference age group. We found that the odds that people in the age group 18-27 will subscribe to a CD are 1.87 times the odds for people in the age group 28-45. That means that students or young professionals (18-27) are 87% more likely to subscribe when compared to people in their 28-45. And the odds of subscription for people 60 years and above are 4 times the odds for people in the age group 28-45. Thus students or young

26

professionals (18-27) and people at the verge of retiring (60) should be the main target customers for the bank. H2: Customers who are singles are more likely to subscribe than when they are married. Marital status was a significant factor in our model. We compared the likely of subscription taking singles as our reference group. It seems the odds that singles will take a subscription offer are 1.26 to 1.57 times the odds of married people taking the offer, meaning about 41% greater chance of subscription. It is often the case that singles are less stable in their lives compared to married couples and therefore look for other sources of income for stability. H3: The chance of subscription increases with a higher degree of education. We cannot comment on people who didnt disclose their highest educational qualification, but their odds of subscription are 1.5 times the odds of people with only primary education. Primary education in Portugal is mandatory and free. Hence it is safe to assume that people who didnt disclose their educational status have either primary or more education. When we evaluate the chances of subscription between people with known education level, it looks like people with secondary education, meaning those who have an undergraduate degree or equivalent qualification, are 31% more likely than people with basic primary education to subscribe. And people who have attained tertiary education such as Masters or PhD are about 81% more likely than primary educated people for a subscription. Therefore it is true that subscription chance increases with a higher degree of education. H4: People who have a good credit history are good targets.

27

As discussed, default was found to be not a statistically significant factor in our model. However we included it because of its business significance. Model results suggest that the odds of people with a good credit history, subscribing to a CD are 1.16 times the odds of people who have defaulted in paying off their financial debts. However the likeliness could vary largely between 0.765 and 1.76, indicating that sometimes even people with a bad credit history could turn out to be a potential customer. This is true in some sense because not always do banks turn down clients with a bad credit history. Some banks even extend offers to such clients giving them a chance to improve their credit score. So it varies from case to case. All we can say is that our data doesnt contain sufficient evidence to validate our hypothesis about credit history. H5: People with less financial liability such as personal loan are more likely to subscribe There is statistically significant proof that people who do not have a personal loan to payback are more likely to subscribe, about 77% more. Our model suggests that the odds of people with no personal loan are 1.77 times the odds of people with a personal loan, for a subscription. If we think of it, people who have taken a personal loan will more likely be concerned about paying back their loans, which means they would have sufficient funds to invest into a CD. H6: People who are more reachable are more likely to subscribe Customers were contacted via various methods such as cellphone, landline and mail offers. Out of all, the highest chance of subscription came from people contacted via cellphones, followed by landline and then mail offers, as suggested by the odds ratios. People are more reachable through cellphones or landlines as opposed to mail offers. Customers usually prefer taking to a human rather than respond to targeted offline advertisements when it comes to dealing with

28

financial matters. Therefore we have statistically significant proof that people who are more reachable are more likely to subscribe. H7: Chances of subscription increases during the months of festive season Q1 and Q4 are the main festive seasons in Portugal. The odds of customers taking a subscription offer during Q1 are greater than any other quarters with the second highest being Q4. Odds for Q3 are close to Q4, indicating that both these seasons have a higher rate of subscription after Q1. However our estimate of Q4 wasnt statistically significant at = 0.01 (p-value = 0.0118). Moreover our descriptive statistics suggested that subscription rate was quite high during the months of September, which falls in Q3 and October & December in Q4. Statistical insignificance of Q4 could be because of less than the overall rate of subscription in November. Therefore Q1 is a definite target for bank representatives and with some confidence the last 4 months of the year as well. But further investigation will be needed to the find the reason for low subscription rate in November. H8: The more the time spent on the call the more likely customers are to subscribe. However repeated calls reduce that chance Duration was measured in seconds spent talking to the customer. For ease of understanding we will interpret the increased chance of subscription for minutes increase in time spent on call. Point estimate suggests that for every 10 minutes increase in the time spent on call, customers are 10 times more likely to take the offer. It makes quite a lot of sense because the longer a bank representative talks to the customer, the more probable it is that the customer is interested in knowing about the product, and therefore more likely to subscribe.

29

Variable campaign, which tells us the number of times a customer was called, has a negative coefficient. This indicates that repeated number of calls reduces a customers chance of subscription. From the odds ratio we estimated that with every repeat call the chance of subscription decreases by 8.5%. Therefore ideally bank representatives should try to spend longer times explaining the deal on one call rather than calling them repeated. H9: Returning customers have higher chance of subscription We included poutcome in our analysis to prove or disprove this hypothesis. When poutcome is a failure i.e. the customer failed to subscribe during a previous marketing campaign, we cannot be totally sure if the customer is going to take the offer now, because our p-value for failure poutcome cases came out to be non-significant (p-value = 0.5371). However when customers did accept a previous offer, they are 12 times more likely to subscribe once again to a new offer. It is usually seen that existing customers were more likely satisfied with a previous deal, which is why they subscribed at the first place. Hence chances are high that they will subscribe again. Therefore targeting existing customers will be a good move and an essential customer retention strategy from the perspective of the bank.

Subscription Model (with pdays)


H10: Subscription chances are higher if customers are contacted a year after a previous campaign

For the purpose of proving our last hypothesis we replaced poutcome with pdays and ran another logistic model, keeping all other variables intact (Exhibit 7). Predictability of our second model

30

was 87.6%, approximately the same as our previous model. The coefficient estimates didnt change much, which is a good sign because pdays is high correlated with poutcome, hence replacing one with the other shouldnt vary our model drastically. We divided pdays into four categories 1) people who were being contacted for the first time (NC) 2) people who were contacted within 6 months after the last campaign 3) people who were contacted after 6 months but before one year and lastly 4) people who were contacted after one year. From our odds ratio we found that people who were contacted after a year were most likely to subscribe followed by people who were contacted within 6 months. For customers contacted between 6-12 months, the chance of subscription fell drastically. So the bank should reach out to customers after a year, at which point they would be 2.5 times more likely to subscribe compared to be in contacted between 6-12 months. We can draw an analogy combining our hypothesis H9 and H10 that repeated calls within a short period of time can increase customer frustration. People who had recently accepted or denied taking an offer are less likely to change their mind within a short span of time.

31

CONCLUSION
Recommended Marketing Strategy
Our goal was to analyze the historical data of a marketing campaign conducted by a Portuguese bank in order to identify the important indicator variables that could help us predict subscription chances. These indicator variables are going to be used to devise a directed marketing strategy targeting only potential customers. The data collected over a two year period reflected statistically significant evidence in favor of a few key factors that are certainly crucial to marketers in the financial service industry. Marketers should pay special attention to existing customers, who have accepted a bank offer previously. They should also focus on people around the age 18-27, when they are usually single and divorced individuals above the age of 60. Because they lack the support of a spouse they tend to look for other means of financial stability and are hence more likely to subscribe. Highly educated people who have a good source of income form a good target audience. People with high salaries are less likely to take a loan because they have sufficient funds to afford their personal expenses and also enough disposable income to invest in banking products like CD. Marketers should focus on reaching out to these customers during the festive seasons (Q1) through cellphones and spend as much time as possible on the call. The more they keep them engaged the more likely they are to take the offer. However they should refrain from calling them repeatedly and must ensure a gap of at least a year before contacting them again. It is true, because in recent times, customer care calls have become so frequent that high volume of calls can lead to increased customer frustration and thereby reduce chances of subscription.

32

EXHIBIT
Exhibit 1-a

Subscription (by Age category)


60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 18-27 28-37 38-47 48-57 58-67 68-77 78-87 88-97

Red line: the overall success rate 11.7%


Exhibit 1-b

Subscription (by Job category)


35.00% 30.00% 25.00% 20.00% 15.00% 10.00% 5.00% 0.00%

Red line: the overall success rate 11.7%

33

Exhibit 1-c

Subscription (by Marital status)


16.00% 14.00% 12.00% 10.00% 8.00% 6.00% 4.00% 2.00% 0.00% single divorced married

Red line: the overall success rate 11.7%


Exhibit 1-d

Subscription (by Education level)


16.00% 14.00% 12.00% 10.00% 8.00% 6.00% 4.00% 2.00% 0.00% tertiary unknown secondary primary

Red line: the overall success rate 11.7%

34

Exhibit 1-e

Default no yes Grand Total

People contacted People subscribed 44391 815 45206

5235 52 5287

11.79% 6.38% 11.70%

Exhibit 1-f

Subscription (by Balance amount)


100.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00%

Exhibit 1-g Housing loan no yes Grand Total

People contacted
20078 25128 45206

People subscribed
3353 1934 5287 16.70% 7.70% 11.70%

Exhibit 1-h

People subscribed Personal loan People contacted no 37964 yes 7242 Grand Total 45206

4804 483 5287

12.65% 6.67% 11.70%

35

Exhibit 1-i Contact cellular telephone unknown Grand Total

People contacted
29280 2906 13020 45206

People subscribed
4367 390 530 5287 14.91% 13.42% 4.07% 11.70%

Exhibit 1-j Day 1-10 11-20 21-31 Grand Total

People contacted
13724 18387 13095 45206

People subscribed
1733 2024 1530 5287 12.63% 11.01% 11.68% 11.70%

Exhibit 1-k

Subscription (by Month)


60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% jan feb mar apr may jun jul aug sep oct nov dec

36

Exhibit 1-l

Subscription (by Duration)


70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00%

Contact duration (secs)

Exhibit 1-m

Subscription (by No. of phone calls)


14.00% 12.00% 10.00% 8.00% 6.00% 4.00% 2.00% 0.00%

37

Exhibit 1-n

Subscription (by pdays)


60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00%

Exhibit 1-o

Subscription (by No. of previous phone calls)


70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 0-9 10-19 20-29 30-39 40-49 50-59 270-279

38

Exhibit 1-p

Subscription (by previous campaign outcome)


70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% success other failure unknown

Exhibit 2-a

39

Exhibit 2-b

40

Exhibit 2-c

Exhibit 3

41

42

Exhibit 4

43

44

Exhibit 5-a
descriptive analysis of successful cases
The FREQ Procedure Table of age by job job(job) admin. age 1827 410 0.91 13.45 7.93 2845 3366 7.45 12.15 65.11 4659 1314 2.91 10.38 25.42 60+ 80 0.18 4.49 1.55 Total 5170 11.44 bluecollar 627 1.39 20.56 6.44 6302 13.94 22.74 64.76 2705 5.98 21.36 27.79 98 0.22 5.50 1.01 9732 21.53 entrepreneur 53 0.12 1.74 3.56 897 1.98 3.24 60.32 511 1.13 4.04 34.36 26 0.06 1.46 1.75 1487 3.29 housemaid 25 0.06 0.82 2.02 547 1.21 1.97 44.11 567 1.25 4.48 45.73 101 0.22 5.66 8.15 1240 2.74 management 351 0.78 11.51 3.71 6303 13.94 22.75 66.66 2610 5.77 20.61 27.60 192 0.42 10.77 2.03 9456 20.92 retired 3 0.01 0.10 0.13 76 0.17 0.27 3.36 1095 2.42 8.65 48.39 1089 2.41 61.08 48.12 2263 5.01 self- services employed 91 0.20 2.98 5.76 1014 2.24 3.66 64.22 433 0.96 3.42 27.42 41 0.09 2.30 2.60 1579 3.49 366 0.81 12.00 8.81 2747 6.08 9.91 66.13 1019 2.25 8.05 24.53 22 0.05 1.23 0.53 4154 9.19 student 593 1.31 19.45 63.22 342 0.76 1.23 36.46 3 0.01 0.02 0.32 0 0.00 0.00 0.00 938 2.07 technician 436 0.96 14.30 5.74 5218 11.54 18.83 68.69 1854 4.10 14.64 24.41 88 0.19 4.94 1.16 7596 16.80 unemployed 85 0.19 2.79 6.52 791 1.75 2.85 60.71 409 0.90 3.23 31.39 18 0.04 1.01 1.38 1303 2.88 unknown 9 0.02 0.30 3.13 107 0.24 0.39 37.15 144 0.32 1.14 50.00 28 0.06 1.57 9.72 288 0.64 45206 100.00 1783 3.94 12664 28.01 27710 61.30 Total 3049 6.74

Statistics for Table of age by job Statistic Chi-Square Likelihood Ratio Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer's V DF 33 33 1 Value 19298.9919 10285.2120 19.7392 0.6534 0.5470 0.3772 Prob <.0001 <.0001 <.0001

Sample Size = 45206

45

Exhibit 5-b
Descriptive analysis of successful cases
The FREQ Procedure Frequency Percent Row Pct Col Pct age 18-27 Table of age by marital marital(marital) divorced married 45 0.10 1.48 0.86 28-45 2618 5.79 9.45 50.28 46-59 2241 4.96 17.70 43.04 60+ 303 0.67 16.99 5.82 Total 5207 11.52 598 1.32 19.61 2.20 15770 34.88 56.91 57.95 9424 20.85 74.42 34.63 1419 3.14 79.58 5.21 27211 60.19 single 2406 5.32 78.91 18.81 9322 20.62 33.64 72.90 999 2.21 7.89 7.81 61 0.13 3.42 0.48 12788 28.29 45206 100.00 1783 3.94 12664 28.01 27710 61.30 Total 3049 6.74

Statistics for Table of age by marital Statistic Chi-Square Likelihood Ratio Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer's V DF 6 6 1 Value 7552.3473 7976.6195 5641.4364 0.4087 0.3784 0.2890 Prob <.0001 <.0001 <.0001

Exhibit 5-c
Statistic Chi-Square Likelihood Ratio Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer's V DF 33 33 1 Value 28483.3522 27753.5813 1256.4379 0.7938 0.6217 0.4583 Prob <.0001 <.0001 <.0001

Exhibit 5-d
46

Statistic Chi-Square Likelihood Ratio Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer's V

DF 11 11 1

Value 3590.2589 3715.7033 711.3738 0.2818 0.2712 0.2818

Prob <.0001 <.0001 <.0001

Exhibit 5-e
descriptive analysis of successful cases
The FREQ Procedure Frequency Percent Row Pct Col Pct housing(housing) no Q1 2763 6.11 13.76 61.01 yes 1766 3.91 7.03 38.99 Total 4529 10.02 Q2 5667 12.54 28.22 25.71 16371 36.21 65.15 74.29 22038 48.75 Table of housing by month month(month) Q3 9151 20.24 45.58 66.70 4569 10.11 18.18 33.30 13720 30.35 Q4 2497 5.52 12.44 50.76 2422 5.36 9.64 49.24 4919 10.88 45206 100.00 25128 55.59 Total 20078 44.41

Statistics for Table of housing by month Statistic Chi-Square Likelihood Ratio Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer's V DF 3 3 1 Value 6466.4067 6642.7065 1162.6843 0.3782 0.3538 0.3782 Prob <.0001 <.0001 <.0001

47

Exhibit 5-f
descriptive analysis of successful cases
The FREQ Procedure Frequency Percent Row Pct Col Pct contact(contact) cellular Q1 4044 8.95 13.81 89.29 telephone 456 1.01 15.69 10.07 unknown 29 0.06 0.22 0.64 Total 4529 10.02 Table of contact by month month(month) Q2 8786 19.44 30.01 39.87 739 1.63 25.43 3.35 12513 27.68 96.11 56.78 22038 48.75 Q3 12182 26.95 41.61 88.79 1164 2.57 40.06 8.48 374 0.83 2.87 2.73 13720 30.35 Q4 4268 9.44 14.58 86.77 547 1.21 18.82 11.12 104 0.23 0.80 2.11 4919 10.88 45206 100.00 13020 28.80 2906 6.43 Total 29280 64.77

Statistics for Table of contact by month Statistic Chi-Square Likelihood Ratio Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer's V DF 6 6 1 Value 16487.9794 19401.6568 3567.4119 0.6039 0.5170 0.4270 Prob <.0001 <.0001 <.0001

Exhibit 5-g
descriptive analysis of successful cases
The FREQ Procedure Frequency Percent Row Pct Col Pct pdays 1 year failure 428 0.95 66.77 8.73 6 months 1680 3.72 52.57 34.28 6-12 months 2793 Table of pdays by poutcome poutcome(poutcome) other 137 0.30 21.37 7.45 684 1.51 21.40 37.17 1019 success 76 0.17 11.86 5.03 832 1.84 26.03 55.06 603 unknown 0 0.00 0.00 0.00 0 0.00 0.00 0.00 0 4415 3196 7.07 Total 641 1.42

48

6.18 63.26 56.99 NC 0 0.00 0.00 0.00 Total 4901 10.84

2.25 23.08 55.38 0 0.00 0.00 0.00 1840 4.07

1.33 13.66 39.91 0 0.00 0.00 0.00 1511 3.34

0.00 0.00 0.00 36954 81.75 100.00 100.00 36954 81.75

9.77

36954 81.75

45206 100.00

Statistics for Table of pdays by poutcome Statistic Chi-Square Likelihood Ratio Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer's V DF 9 9 1 Value 46386.7985 43177.3094 32433.1429 1.0130 0.7117 0.5848 Prob <.0001 <.0001 <.0001

Exhibit 6
logistic regression
The LOGISTIC Procedure Model Information Data Set Response Variable Number of Response Levels Model Optimization Technique MYSAS.BANK_RECODED subscription 2 binary logit Fisher's scoring subscription

Number of Observations Read Number of Observations Used

45201 45201

Response Profile Ordered Value 1 2 subscription 0 1 Total Frequency 39916 5285

Probability modeled is subscription=1.

Class Level Information Class age Value 18-27 28-45 46-59 Design Variables 1 0 0 0 0 1 0 0 0

49

60+ marital divorced married single education primary secondary tertiary unknown default no yes loan no yes contact cellular telephone unknown month Q1 Q2 Q3 Q4 poutcome failure other success unknown

0 1 0 0 0 1 0 0 0 1 0 1 1 0 0 0 1 0 0 1 0 0 0

0 0 1 0 0 0 1 0

0 0 0 1

0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0

Model Fit Statistics Intercept Only 32614.294 32623.013 32612.294 Intercept and Covariates 22833.093 23033.627 22787.093

Criterion AIC SC -2 Log L

Testing Global Null Hypothesis: BETA=0 Test Likelihood Ratio Score Wald Chi-Square 9825.2014 12333.6287 6352.2152 DF 22 22 22 Pr > ChiSq <.0001 <.0001 <.0001

Type 3 Analysis of Effects Effect age marital education DF 3 2 3 Wald Chi-Square 473.8791 66.0424 114.9996 Pr > ChiSq <.0001 <.0001 <.0001

50

default balance loan contact month duration campaign previous poutcome

1 1 1 2 3 1 1 1 3

0.8578 17.6666 96.7177 478.0778 42.3857 4119.3882 83.0268 4.5211 1449.6631

0.3544 <.0001 <.0001 <.0001 <.0001 <.0001 <.0001 0.0335 <.0001

51

Association of Predicted Probabilities and Observed Responses Percent Concordant Percent Discordant Percent Tied Pairs 88.9 11.1 0.0 210956060 Somers' D Gamma Tau-a c 0.779 0.779 0.161 0.889

52

Partition for the Hosmer and Lemeshow Test subscription = 1 Group 1 2 3 4 5 6 7 8 9 10 Total 4520 4520 4520 4520 4522 4520 4520 4520 4520 4519 Observed Expected 8 21 31 61 104 212 369 672 1246 2561 35.98 68.61 108.50 156.42 209.42 271.99 360.62 521.26 917.92 2634.43 subscription = 0 Observed 4512 4499 4489 4459 4418 4308 4151 3848 3274 1958 Expected 4484.02 4451.39 4411.50 4363.58 4312.58 4248.01 4159.38 3998.74 3602.08 1884.57

Hosmer and Lemeshow Goodness-of-Fit Test Chi-Square 443.7519


DF

Pr > ChiSq <.0001

Classification Table Correct Prob Level 0.000 0.100 Event 5285 4363 NonEvent 0 31545 Incorrect Event 39916 8371 NonEvent 0 922 Correct 11.7 79.4 Percentages Sensitivity 100.0 82.6 Speci- False ficity POS 0.0 79.0 88.3 65.7 False NEG . 2.8

53

0.200 0.300 0.400 0.500 0.600 0.700 0.800 0.900 1.000

3233 2564 2113 1716 1282 914 587 269 0

36567 37945 38592 38997 39274 39494 39648 39785 39916

3349 1971 1324 919 642 422 268 131 0

2052 2721 3172 3569 4003 4371 4698 5016 5285

88.1 89.6 90.1 90.1 89.7 89.4 89.0 88.6 88.3

61.2 48.5 40.0 32.5 24.3 17.3 11.1 5.1 0.0

91.6 95.1 96.7 97.7 98.4 98.9 99.3 99.7 100.0

50.9 43.5 38.5 34.9 33.4 31.6 31.3 32.8 .

5.3 6.7 7.6 8.4 9.2 10.0 10.6 11.2 11.7

54

55

56

57

58

59

Exhibit 7
logistic regression
The LOGISTIC Procedure Model Information Data Set Response Variable Number of Response Levels Model Optimization Technique MYSAS.BANK_RECODED subscription 2 binary logit Fisher's scoring subscription

Number of Observations Read Number of Observations Used

45201 45201

Response Profile Ordered Value subscription


1 2

Total Frequency 39916 5285

0 1

Probability modeled is subscription=1.

Class Level Information Class age Value 18-27 28-45 46-59 60+ Design Variables 1 0 0 0 0 0 1 0
0 0 0 1

60

marital

divorced married single

1 0 0 0 1 0 0 0 1 0 1 1 0 0 0 1 0 0 1 0 0 0

0 1 0 0 0 1 0
0 0 0 1

education

primary secondary tertiary unknown

default

no yes

loan

no yes

contact

cellular telephone unknown

0 1 0 0 0 1 0 0 1 0 0
0 0 0 1 0 0 1 0

month

Q1 Q2 Q3 Q4

pdays

1 year 6 months 6-12 months NC

61

62

Odds Ratio Estimates Effect age 18-27 vs 28-45 age 46-59 vs 28-45 age 60+ vs 28-45 marital divorced vs single marital married vs single education secondary vs primary education tertiary vs primary education unknown vs primary default yes vs no balance loan yes vs no contact cellular vs unknown contact telephone vs unknown month Q2 vs Q1 month Q3 vs Q1 month Q4 vs Q1 duration campaign previous pdays 1 year vs NC pdays 6 months vs NC pdays 6-12 months vs NC Point Estimate 2.006 1.102 4.511 0.830 0.706 1.353 1.893 1.687 0.828 1.000 0.522 3.685 3.099 0.763 0.735 0.808 1.004 0.907 1.018 3.937 3.112 1.586 99% Wald Confidence Limits 1.719 0.983 3.791 0.708 0.634 1.165 1.619 1.324 0.546 1.000 0.451 3.131 2.454 0.663 0.637 0.686 1.004 0.884 0.995 3.012 2.672 1.362 2.341 1.234 5.367 0.974 0.787 1.571 2.213 2.151 1.254 1.000 0.606 4.336 3.914 0.878 0.848 0.951 1.004 0.930 1.042 5.146 3.624 1.848

Association of Predicted Probabilities and Observed Responses Percent Concordant Percent Discordant Percent Tied Pairs 87.6 Somers' D 12.4 Gamma 0.0 Tau-a 210956060 c 0.753 0.753 0.155 0.876

63

Partition for the Hosmer and Lemeshow Test subscription = 1 Group 1 2 3 4 5 6 7 8 9 10 Total 4520 4520 4520 4520 4520 4520 4521 4521 4520 4519 Observed Expected 5 22 31 66 136 232 440 716 1264 2373 34.29 67.24 107.16 158.74 218.02 293.10 406.52 605.77 997.32 2396.82 subscription = 0 Observed 4515 4498 4489 4454 4384 4288 4081 3805 3256 2146 Expected 4485.71 4452.76 4412.84 4361.26 4301.98 4226.90 4114.48 3915.23 3522.68 2122.18

Hosmer and Lemeshow Goodness-of-Fit Test Chi-Square 331.9394


DF

Pr > ChiSq <.0001

64

65

66

67

68

69

70

REFERENCE
Wikipedia. "Retirement Age." Wikipedia. Wikimedia Foundation, 5 Dec. 2013. Web. <http://en.wikipedia.org/wiki/Retirement_age>.

European Union. "Living in Portugal - Educational Systems." AngloINFO Portugal. Everything for Expats Living in or Moving to Portugal/. AngloInfo, Aug. 2010. Web. <http://portugal.angloinfo.com/family/eu-factsheets-family/educational-system-eu/>. Portugal Tourist Info. "Portugal Festivals." PortugalVisitor. PortugalVisitor, n.d. Web. <http://www.portugalvisitor.com/portugal-culture/portugal-festivals>. Utoronto. "Crosstabulation with Nominal Variables." Groups.chass.utoronto.ca/. Utoronto, n.d. Web. <http://groups.chass.utoronto.ca/pol242/Labs/LM-3A/LM-3A_content.htm>.

71

S-ar putea să vă placă și