Sunteți pe pagina 1din 15

Predicting Extramarital Affairs

With Machine Learning Models


Texas Woman’s University

Alwin Bethel
Sheila Tupker
Yashada Pillai

Abstract. When almost a fourth of married couples confess to having engaged in


extramarital affairs, this raises the question - what exactly determines the
likelihood of someone having an affair? To answer this question, a data set
containing various attributes of individuals was used to examine and run the
machine learning models. The models employed in this study include Decision
Trees, Random Forest, and Logistic Regression. We found, for this particular
dataset, we had the most success with the Random Forest model using our top
four predictor variables: age, rate of happiness, religiosity, and education level.
The Random Forest model indicated a relatively limited level of overfitting and
held the least variation amongst the training score, testing score, and f-score.

Keywords: infidelity, extramarital, affairs, big data, predictive analytics,


Decision Trees, Random Forest, Logistic Regression

1 Problem
Our foray into finding an interesting and manageable project led to the actionable
choice in choosing extramarital affair data, both for its intrinsic value of significance
and feasibility. Nearly a fourth of the population of married individuals in the United
States admits to engaging in at least one act of sexual infidelity while married [1]. This
target variable of extramarital involvement has corresponding attributes that are
possibly indicative of certain innate characteristics of the unfaithful partner. These
characteristics can range from the level of education to the degree of religiosity of said
individuals. This project seeks to agglomerate these intrinsic demographics and
structure an informed prediction of the characteristics of having extramarital affairs.
Moreover, the project systematically employs machine learning algorithmic
techniques via the conduit of Decision Tree, Random Forest, and Logistic Regression
modeling. The intention to utilize predictive modeling grounds its basis upon the
ability to strategically determine, by means of common characteristics of adulterous
individuals, whether or not a given individual has a likelihood of engaging in
extramarital involvement. These “at-risk” individuals will be referred to, for the
remainder of this report, as statistically vulnerable individuals (SVIs). Our initial
questions are as follows: Given this particular data set, can we predict whether or not
someone will have an affair? Based on our results, are there identifiable attributes
among those more likely to have an affair? In a practical sense, choosing to answer
these questions takes into concern 81 million individuals (and their spouses and
families) that are affected by infidelity in a marriage [1]. Answering these questions
may not prevent a potential spouse from entering into a relationship with an SVI or
prevent an SVI from being unfaithful, but at the least a married individual can learn
from our study and be more informed about the potentiality of extramarital activity.

2 Dataset
To download the dataset, within the following URL, scroll to the Affairs section of the
page, and click the text below the graphical output that states “you can download the
data here”: http://koaning.io/fun-datasets.html. The dataset contains 601 rows and 9
original columns (10 columns after programmatic addition). We have created a data
dictionary, Table 1, which correlates and describes our dataset, and can be applied as a
ledger for the methodologies in our research study.

3 Methodology
The following information was included within our initial proposal and is included for
referential purposes only. Goal: Our goal is to learn from the data set whether
someone is more likely to have an affair, and to develop an appropriate prediction
outcome. Methodology: We will use the Decision Tree algorithm to solve our
question.
We followed a loosely scientific approach for this project, using step-by-step
trial and error methodology. Although, we didn’t formulate an explicit hypothesis, we
used the initial questions posed in our proposal as a guide. We resolved to experiment
with what would work, what wouldn’t work, and why. We continued finding solutions
and improving our model after answering our initial questions. Initially, we answered
our questions with a Decision Tree model, and continued to advance our research with
Random Forest and Logistic Regression models, as detailed below. Consequently, our
two source files (BigDataProject.ipynb and SecondAttemptBigDataProject.ipynb)
contain duplicate code for the data munging portions of each file. This is because we
wanted each source file to stand on its own as a complete set of code, which can run
the entire actualized set of models for this research project.
3.1 Details of Python Source Files

The initial process began with seeking out data that would correspond to a topic of
interest, choosing a predictive model, and applying Python code to execute it (see
beginning of methodology section for information from our initial proposal). The
process of executing our code began with the importing of various packages (and
corresponding models) that we deemed necessary for this process. These packages
include, but are not limited to: NumPy, Pandas, SciPy, sklearn, Matplotlib,
and os. After importing our dataset, we briefly examined it using functions like
data.info and data.describe(), among others, to analyze our data. We also
proceeded with a distribution analysis (using histograms) to determine the variables in
our dataset’s form, center, and spread, and establish outliers or skewness. Within our
distribution analysis, in cell 13, we checked for null values, of which there were none.
From there, with the categorical variables, we examined the values within each class
for validity as well as frequency. This is particularly important for our target variable
because when there is an imbalance in the classes, it can skew and bias our results in
our Logistic Regression model, leading to illegitimate accuracy scores. Then we
checked to make sure that the ordinal and nominal variables were encoded differently,
which they were. An important note - for ordinal variables, order must be maintained,
while for nominal variables order doesn’t matter, so we can’t assign the same ordering
that we had used for our ordinal variables. Thus, we would usually use one hot
encoding for nominal variables, but we did not need to in this case. After this, we
applied the innate Python LabelEncoder to transform binary variables like sex and
presence of children into corresponding numerical values. From there, we employed a
categorical variable analysis, starting in cell 46, which distinguishes probability trends
amongst different variables. Specifically, we wanted to know the probability of having
an affair based on other characteristics. This process was repeated for every
deterministic variable that could potentially play an impact on extramarital activity.
For our first attempt, we defined a function that would run different models
using k-fold cross-validation to divide our training and testing data. For our second
attempt, we used train_test_split from sklearn to ensure an 80% training
and a 20% testing set. We continued on using this 80/20 split in the application of our
models. Next, we employed an over-sampling of our training data to handle the
imbalance of frequency in our target variable, ‘affair’. This was accomplished by
using the SMOTE package, imported from imblearn.over_sampling. Before
applying our models in our second attempt, we used Random Forest modeling to find
the importance of each feature and administered our findings for feature selection. In
this process, initially we found and ranked the importance of every predictive variable.
We proceeded to run our Random Forest model using the top one feature, then the top
two features, all the way through the top eight features [cell 989]. We discovered that
using the top four features (age, religiosity, education, and rate of happiness) provides
us the highest f-score, which is a measure of the “harmonic mean of precision and
recall” [5]. For our further analysis we ran our models with the top eight features and
the top four to discover differences and accuracies.

3. 2 Phase 1: Methodology of Decision Tree Modeling

The following corresponds to our first attempt Python file (BigDataProject.ipynb). In


our first attempt we implemented a Decision Tree model. We used a few learning
algorithms in our project to solve a classification system, where we classified the data
in a manner that predicted whether or not variables had an impact on affair.
Essentially, we defined an algorithmic function called classification_model
that contains the parameters model, data, predictors, and outcome. The
specific model we administered was a Decision Tree which was incorporated into the
model parameter. This line of code coupled with the fit()was the application of a
learning segment within our algorithm, and is referred to as “fitting the model”. In this
case, the parameters for our learning segment were predictors and outcome.
Predictors (listed in variable form), specifically, were sex, age, ym, child,
religious, education, occupation, and rate. Outcome is manifested as no
affair or affair, with corresponding output 0 or 1, respectively. This overall process is
covered within the comments in cell 54 and visualized in Figure 1. From this point, we
applied the algorithm to the affair dataset. The output includes an accuracy percentage
and a cross-validation score, which helps us assess how our results will generalize to
an independent data set. Finally, in cell 67, a Logistic Regression model was applied
to our predictor variable and outcome variable. To better understand the holistic
process, see figure 1.

3.3 Phase 2: Methodology of Random Forest

The following portion of methodology corresponds to our second attempt Python file
(SecondAttemptBigDataProject.ipynb). Aforementioned, we took the approach of
applying the scientific method to this problem, and consequently proceeded to find a
potentially better solution to our questions. A Random Forest model takes Decision
Tree modeling to the next level by producing a collection of Decision Trees and
averaging the outcome of every Decision Tree to get the accuracy for our Random
Forest model. For this second iteration of our project, we used the
RandomForestClassifier from the sklearn package. We also eliminated the
nbaffairs variable as it was redundant, in that we based the affair column on this
field (cell 969). We trained the model using the fit() function and used the feature
importances classifier to narrow down which features were important to the model
(opposed to finding this insight based solely on categorical variable analysis). The
purpose of feature selection seeks to evaluate the most significant features by
extracting a feature, running the Random Forest algorithm, and repeating the process
with the n+1 features. Essentially, this process provides insight to which features are
the most relevant. After feature selection we ran a Random Forest with our newly
discovered top four features (age, religiosity, education level, and rate of happiness),
as well a run with all features. Additionally, a cross-validation score was calculated to
check for overfitting.

3.4 Phase 3: Methodology of Logistic Regression

The following also corresponds to our second attempt mode Python file
(SecondAttemptBigDataProject.ipynb). Finally, the last model we included, in 993,
was a Logistic Regression model. First, we ran the model using all of our predictive
variables, then we ran the model using the top four attributes gained from our earlier
feature selection. We employed the following: penalty = L2, C = 1.0, and
random_state = 0, as arguments passed to our Logistic Regression classifier
(lr_classifier). In this model, penalty increases or decreases the sparsity of
the model, C increases or decreases the degrees of freedom, and random_state
influences consistency of data [8]. We then used this classifier as a prediction tool by
inputting contingencies via the training model, using the arguments
sub_update_x_train and update_y_train. The testing of our Logistic
Regression model was actuated by inputting, sub_predict_y_lr =
lr_classifier.predict (sub_x_test). We used cross-validation to
check that our model was not overfitting, just as we did in our previous models. We
also used the f-score value for this model to determine the precision and recall [5].

4 Results
Our results and research methods are grounded upon several overlying assumptions
and enhancements, as follows. Decision Tree modeling assumes non-linearity in form,
is fairly robust to outliers, considers all possible outcomes, and can be especially
useful in the analysis of data containing non-linear features [6]. Logistic Regression
modeling holds the potential to be: comparatively straightforward, regularized to avoid
overfitting, implemented in Python, and easily updated with new data using stochastic
gradient descent, when working with a linear model [7]. Random Forests are more
encompassing and efficient in that they combine predictions from individual trees by
actuating an average from several trees during training [8].
4.1 Phase 1: Decision Tree Results

Without sending in any arguments into our Decision Tree we found that our initial
accuracy was 98.669%, but when we used cross-validation, the value measured to be
62.206%, which lead us to believe we have overfitting, because of the gap between the
accuracy and the cross-validation scores. Cross-validation can be used in a prediction
model to evaluate the model by partitioning the model into training and testing sets.
So, our goal is to get a higher cross-validation score and a smaller gap between the
accuracy and cross-validation scores. To accomplish this, we tweaked the criterion
parameters of our model, and even as our accuracy level went down, the cross-
validation score increased the validity of our model. We learned that our current value
of max leaf nodes was 407. We decided to prune our tree and set our max leaf nodes to
100, per the suggestion of outside research. This led to a reduced accuracy score of
91.348% and an increased cross-validation score of 66.863%. This is a positive change
because we received a higher cross-validation, closing the gap between the accuracy
and cross-validation values. Next, we tweaked some more to see if we could get an
even better result, so we pruned out tree down to 10 leaves. This lead to an even lower
accuracy score of 79.534% (still relatively accurate) and a higher cross-validation
score of 68.855%, which was closer, but to continue to progress the model we tried
reducing the max_depth of the tree. This resulted in a slightly smaller accuracy
score of 78.536% and a slightly higher cross-validation score of 69.196%. These
readings were not an indication of significant progress but, nonetheless, an
improvement. At this point, we decided to find out which, if any, variables were more
important than the others, so we ran a feature importance check and learned our
predictor variables sex and years married have no effect on our model’s
outcome. Next, we ran the model minus these two variables, and this resulted in an
accuracy score of 78.536% and cross-validation score of 70.363%. This did not help
our accuracy, but it did help our cross-validation score. We tried a few other
combinations of tweaking but found that the combination of max leaf nodes set at 10
and predictor variables set at age, religious, and rate had the best yield of accuracy at
78.369% and a cross-validation score of 73.843%. We then ran a Logistic Regression
analysis, even though it was not mentioned in our initial proposal, we wanted to
experiment and expand our model. Our findings output an accuracy rate of 76.040%
and a cross-validation value of 71.680%.
In our initial attempt, we not only discovered that we could predict whether
or not an affair would occur, but we were also able to determine what variables would
weigh in more heavily on the outcome. This completes the promised components of
our proposal and from henceforth we did experiments and continued to progress our
model in the vein of learning more about Data Science, in particular, machine
learning.
4.2 Phase 2: Random Forest Results

This second phase coincides with phase 2 of our methodology, which was a
continuation and progression of our original proposal. Now that we had an established
model with the employment of machine learning techniques, we decided to try out
Random Forest modeling. We found Random Forest to be interesting as it takes many
Decision Trees into its decision-making process and results in the average of all of the
Decision Trees.
As stated earlier, we determined feature importances and split the data into
training and testing before running our Random Forest model. During our first
attempt, with the inclusion of all variables, the model yielded a training accuracy of
97.917% and a testing accuracy of 71.074%. This results in overfitting, because of the
wide gap between these two accuracy scores. We also ran the f-score, which produces
a weighted average between precision and recall [5]. Next, we ran the model with the
top four variables, and this resulted in a train accuracy of 92.778% and a test accuracy
of 79.339%, a relatively high accuracy score for us. Then, we sought to extract a cross-
validation value from our model to check for overfitting, using 6-fold cross-validation,
which resulted in a 79% accuracy level. This amount of overfitting is acceptable,
because the cross-validation score and the testing accuracy score are relatively
equivalent.

4.3 Phase 3: Logistic Regression Results

As previously stated, this phase is a continuation and progression of our experiment


and corresponds to phase 3 of our methodology. We ran our Logistic Regression
model with the inclusion of all variables, by sending in the arguments penalty =
12, c = 1.0, and random_state = 0, as this resulted in a training accuracy of
67.361%, a testing accuracy of 65.289%, and an f-score of 65.289%. Next, we chose
to run the same model only using the top four predictors, which resulted in a training
accuracy of 63.75%, a testing accuracy of 64.462%, and an f-score of 64.462%. Due to
our scores being very close together, we do not have a significant amount of
overfitting. Lastly, we conducted a 6-fold cross-validation check, which resulted in a
cross-validation average accuracy score of 53%. For this data, cross-validation was a
better method to employ for this Logistic Regression problem.
4.4 Interpretation & Conclusion

As stated, we took a scientific approach to this research study, and therefore, our
models were able to expand and progress over the course of our project. Based on the
results from this research project, we can conclude that the specific characteristics
intrinsic to this dataset can be used to predict whether or not a statistically vulnerable
individual (SVI) will partake in extramarital activity. Our categorical variable analysis
was intended to be used as a premature ledger for the findings in our study discovered
through Decision Tree analysis. Rate seemingly plays an impact on SVIs, in that when
the rate of happiness is 1, the probability of infidelity is 50% compared to a rate of 5
and corresponding probability of 14%. Age is seemingly also a factor, in that a
marriage that involves individuals of a lower age seems to have a correlation with a
higher rate of infidelity. Infidelity is also 50% more likely for families in which
children are present. Rate of happiness is a variable with more pull; it indicates a
correlation of the higher the rate of happiness, the less likely an affair is to occur. In
our Decision Tree model, we consigned a culmination of our variables to a predictor
value and compared it to our outcome variable. Our initial model produced a 98.6%
accuracy rate with a cross-validation score of 62.2%, which answered our initial
questions and substantiated our research. We continued to improve this research study
by instituting a system that balances the accuracy and cross-validation scores. We
further progressed our model through a loosely-based scientific approach and
employed techniques such as Logistic Regression and Random Forest modeling.
In conclusion, we found that we had the most success with the Random
Forest model using our top four predictor variables. This model held the least variation
in our training score, testing score, and f-score, essentially preventing significant
overfitting. With the intention of illustrating this notion more clearly, we have
appended three charts for comparison of each of our models and their output (Figure
2A-C).

5 Contributions
The adoption of team roles initially left us conflicted as to what specific roles to take
on as we equally wanted to gain experience with this project. Consequently, we
ventured to equitably contribute to different parts of the project, with each of us also
having individual emphases with specific tasks. Sheila had an emphasis in the
discovery phase of the project and obtaining and actualizing a dataset, as well as
designing the source files. She emphasized developing our corresponding algorithms
and application methods. Yash worked on the data set, adding the affairs column, the
data dictionary, the pros and cons of Decision Tree modeling, and proofreading the
report. Alwin prioritized the documentation of this report and established a working
articulation of the source files into relatable content. Alwin recorded the processing of
our source files, while acquiring corroborating research documentation, editing, and
organizing our information efficiently and thoroughly. Ultimately, we aspired to
balance the workload, while sharing responsibilities and developing our machine
learning skills.

References
1. Behar M (2017) Personality and Sexual Predictors of Infidelity in Marital
Relationships.
https://search.proquest.com/openview/0a1c54b9573ba15a7f854caef4c2a5e9/1?pq-
origsite=gscholar&cbl=18750&diss=y. Accessed 5 Apr 2018

2. Bessey D (2015) Love Actually? Dissecting the Marriage-Happiness Relationship.


Asian Economic Journal 29:21–39. doi: 10.1111/asej.12045

3. Fair, R (1977) “A note on the computation of the tobit estimator”, Econometrica, 45,
1723-1727. http://fairmodel.econ.yale.edu/rayfair/pdf/1978A200.PDF. Accessed 4
Apr 2018

4. Fair R (1978) A Theory of Extramarital Affairs. Journal of Political Economy 86:45–


61.

5. Guns R, Lioma C, Larsen B (2012) The tipping point: F-score as a function of the
number of retrieved items. Information Processing & Management 48:1171–1180.
doi: 10.1016/j.ipm.2012.02.009

6. Hamel G (2017) Advantages & Disadvantages of Decision Trees. In: Techwalla.


https://www.techwalla.com/articles/advantages-disadvantages-of-decision-trees.
Accessed 22 Apr 2018

7. Logistic Regression vs Decision Trees vs SVM: Part II. Retrieved from


https://www.edvancer.in/logistic-regression-vs-decision-trees-vs-svm-part2/

8. Scikit Learn (2017) sklearn.linear_model.LogisticRegression. In:


sklearn.linear_model.LogisticRegression - scikit-learn 0.19.1 documentation.
http://scikit-
learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html.
Accessed 22 Apr 2018

9. Stein J, Song S, Coady E (2005) Is There a HITCH? TIME 165: Electronic Access.
http://search.ebscohost.com.ezp.twu.edu/login.aspx?direct=true&db=a9h&AN=15599
381&site=ehost-live. Accessed 12 Apr 2018
Appendix
Table 1. Data Dictionary [4]. URL: http://koaning.io/fun-datasets.html. Corresponding legend is
as follows.
* Originally a string, converted to integer for categorical data
** Affairs column not included in original data - created based off of data in nbaffairs column

Variable Label Measurement Data Description


Name Type

sex* Sex (gender) Nominal integer • 1 = Male


• 0 = Female

age Age in years Ordinal float • 17.5 = Under 20 yrs


• 22.0 = 20 - 24 years
• 27.0 = 25 - 29 years
• 32.0 = 30 - 34 years
• 37.0 = 35 - 39 years
• 42.0 = 40-44 years
• 47.0 = 45-49 years
• 52.0 = 50-54 years
• 57.0 = 55+ years

ym The number of years married Ordinal float • 0.125 = 3 months or less


• 0.417 = 4-6 months
• 0.75 = 6 months - 1 year
• 1.5 = 1-2 years
• 4 = 3-5 years
• 7 = 6-8 years
• 10 = 9-11 years
• 15 = 12+ years

child* Did they have children? Nominal integer • 1 = Yes


• 0 = No

religious Level of religious beliefs Ordinal integer 1. Anti


2. Not at all
3. Slightly
4. Somewhat
5. Very

(continued on next page)


education Years of education Ordinal integer • 9 = Grade School
• 12 = High School
• 14 = Some College
• 16 = College Graduate
• 17 = Some Graduate School
• 18 = Master’s Degree
• 20 = Advanced Degree (Ph.D, MD,
other)

occupation Type of occupation Ordinal integer 1. Student


2. Semi-Skilled/Unskilled
3. White Collar
4. Skilled Worker
5. Administrative
6. Degree
7. -

rate Rating of happiness in Ordinal integer 1. Very Poor


marriage 2. Poor
3. Fair
4. Good
5. Very Good

nbaffairs** Number of affairs after Ordinal integer • 0 = none


marriage • 1 = 1 affair
• 2 = 2 affairs
• 3 = 3 affairs
• 7 = 4-10 affairs
• 12 = 10+ affairs

affairs** affair or not Ordinal Boolean • 0 = No Affair


• 1 = Affair
Figure 1. [Source file BigDataProject.ipynb, cell 55]. We came up with this visualization to
help us understand what was going on in the following code. Groupings include: train predictor,
train outcome, test predictor, and test outcome. Created Group 1 using predictors in the data,
found all that have been designated as train, pulled out the predictor columns, and had iloc put
them into train_predictors. Created Group 2 using outcome (meaning affair), pulled out
all that have been designated as train, and put it into train_target. train_target =
data[outcome].iloc[train]. Took Groups 1 and 2 and put them into model.fit().
Using the information we gathered (train_predictors and train_target) handed this
over to method model.fit(). model.fit(train_predictors, train_target).
Created Groups 3 and 4 - Group 3 is test and predictors, Group 4 is test and outcomes. Group 3
and 4 was put into the method model.score.
Figure 2A. [Source file BigDataProject.ipynb, cells 54 to 65] A comparison of the three models
studied in this project: Decision Tree, Random Forest, and Logistic Regression. The scale in this
model has been altered to illustrate magnitude of change more efficiently.
Figure 2B. [Source file SecondAttemptBigDataProject.ipynb, cells 987 to 992] A comparison
of the three models studied in this project: Decision Tree, Random Forest, and Logistic
Regression. The scale in this model has been altered to illustrate magnitude of change more
efficiently.
Figure 2C. [Source file SecondAttemptBigDataProject.ipynb, cells 993 to 995] A comparison
of the three models studied in this project: Decision Tree, Random Forest, and Logistic
Regression. The scale in this model has been altered to illustrate magnitude of change more
efficiently.

S-ar putea să vă placă și