Sunteți pe pagina 1din 10

Book:

• Business Theory: Provost, F., & Fawcett, T. (2013).Data Science for Business: What you need to
know about data mining and data-analytic thinking." O'Reilly Media, Inc.".
• Technical Application: Tan, P.-N., Steinbach, M., & Kumar, V. (2006). Introduction to Data Mining.
Pearson Addison Wesley.

Data-Analytic Thinking
Data-analytic Thinking: When faced with a business problem, you should be able to assess
whether and how data can improve performance.

Big data: Datasets that are too large for traditional data processing systems, and therefore
require new processing technologies.
Data mining: The extraction of knowledge from data, via technologies that incorporate these
principles.
▪ Also known as Knowledge Discovery in Database.
▪ To discover pattern and insight within a database → set of data. And use this knowledge to make decision.

▪ “…the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable
patterns in data.”
▪ Purposes: Classification/Prediction, Regression, Clustering, Association.
▪ Benefit: Can handle complex patterns in high-dimensional data that human brain can’t.
▪ Example: Credit Scoring

Big data technologies: Tools that specially designed to handle, process, and harness huge amount
of data.
▪ Big data technology ≠ Parallel computing
▪ Used for both process: Data Processing and Data Mining
Data science: Set of fundamental principles that guide the extraction of knowledge from data.
Business Problem and Data Science Solution
Describing Phenomena Predicting the Future
Clustering Prediction
Association Analysis Classification
Regression
Prescription
CRISP-DM (Cross Industry Standard Process for Data Mining): Data Mining Process’

Descriptive Analytics
Types of Quantitative Statistical Methods
Descriptive statistics: Summary statistic that quantitatively describes or summarizes features of a
collection of information
Descriptive statistics is solely concerned with properties of the observed data, and it does not rest on the
assumption that the data come from a larger population.
Inferential statistics: Infers properties of a population, for example by testing hypotheses and
deriving estimates from observed data sample
Basic Summary Statistics
1. Location or central tendency → Arithmetic Mean, Median, Mode
2. Spread or data dispersion → Standard deviation, variance, range, interquartile range
3. Shape → Skewness or kurtosis
4. Correlation → Dependence between paired variables
a. Pearson correlation coefficient
Assesses linear relationships.
b. Spearman's rank correlation coefficient
Assesses monotonic relationships (whether linear or not).

Principal Component Analysis (PCA): One of correlation analysis aiming to derive linearly
uncorrelated variables called principal components
▪ Often use to visualize high dimensional dataset.
▪ Very useful to analyse high-dimensional data because we can extract the most important aspects to learn
about.
▪ Can reduce data dimensionality and keep the information loss to minimum.
PCA for High Dimensional Data Visualization
Data Visualization
Exploratory Data Analysis (EDA): an approach to analysing datasets to summarize their main
characteristics, often with visual methods.
▪ EDA is for seeing what the data can tell us beyond the formal modelling or hypothesis testing task.
Therefore, visualization is also very important for data science for initial phase.
▪ Types of Popular Viz for EDA: Boxplot, Histogram, Scatter plot, Cross tab, Cross feature scatter plot.
Important Visual Component: Size, Colour, Shape, Length, Direction, Map, Time (animated)
Clustering
Cluster analysis: Divides data into groups that are meaningful, useful, or both. Based only on
information found in the data that describes the objects and their relationships.
▪ The goal: the objects within a group be similar (or related) to one another and different from (or unrelated
to) the objects in other groups.
▪ Use data mining techniques to automatically find classes. Mostly unsupervised.
▪ Classes: Conceptually meaningful groups of objects that share common characteristics.
▪ Clusters: Potential classes.
▪ An entire collection of clusters is commonly referred to as a clustering.

Different Types of Clustering


1. Hierarchical (nested) vs. Partitional (un-nested)

▪ Partitional clustering: A division of the set of data objects into non-overlapping subsets (clusters) such
that each data object is in exactly one subset → mutually exclusive.
▪ Hierarchical clustering: Permits clusters to have sub-clusters (nested).
2. Exclusive vs. Overlapping vs. Fuzzy
▪ Exclusive: Each object belongs to a single cluster.

▪ Overlapping: An object can simultaneously belong to more than one group (class).
▪ Fuzzy: Very object belongs to every cluster with a membership weight that is between 0 (absolutely
doesn’t belong) and 1 (absolutely belongs). Probabilistic clustering techniques compute the probability
with which each point belongs to each cluster, and these probabilities must also sum to 1.

3. Complete vs. Partial


▪ Complete clustering: assigns every object to a cluster.
▪ Partial clustering: not every object is assigned to a cluster → objects in the data set may represent
noise, outliers.
Simple Clustering Techniques
1. K-means → Centroid or Medoid-based.
▪ K-means uses centroid: The mean of a group of points and almost never corresponds to an actual data
point.

▪ K-medoid uses medoid: The most representative actual data point for a group of points.
2. Agglomerative Hierarchical Clustering → distance-based hierarchical bottom-up clustering.
▪ Hierarchical or nested clustering: Clusters have sub-clusters.
▪ Start with the points as individual clusters and, at each step, merge the closest pair of clusters. This
requires defining a notion of cluster proximity (nearness/distance).
▪ 3 common proximity definitions: MIN, MAX, AVG.
▪ Commonly illustrated by a dendrogram and a nested cluster diagram.
3. DBSCAN → density-based
Core points: These points are in the interior of a density-based cluster. A point is a core point
if the number of points within a given neighbourhood around the point as
determined by the distance function and a user-specified distance parameter,
Eps, exceeds a certain threshold, MinPts, which is also a user-specified
parameter. In Figure 8.21, point A is a core point, for the indicated radius (Eps)
if MinPts≤ 7.
Border points: A border point is not a core point but falls within the neighbourhood of a core
point. In Figure 8.21, point B is a border point. A border point can fall within
the neighbourhoods of several core points.
Noise points: A noise point is any point that is neither a core point nor a border point. In Figure
8.21, point C is a noise point.

▪ Density-based clustering: Locates regions of high density that are separated from one another by
regions of low density.
▪ Can work on non-globular cluster (better than K-means and AHC).
Association Analysis: To discover interesting relationships hidden in a large set of data
represented in the form of association rules or sets of frequent items.
▪ Most commonly used for market basket analysis.
▪ Association analysis can also be used to analyse traditional two-dimensional table data using one-hot
encoding.
Problem Definition: The basic terminology used in association analysis.
1. Binary Representation: To represent the dataset into a fixed two-dimensional table.

▪ 1 means presence, 0 means absence.


▪ A very simplistic way, neglecting the item quantity.
▪ The number of column/fields is determined by the number of unique items in the dataset.
2. Itemset: every possible subset of all items in the dataset, including the null (empty) dataset.

▪ If we have 3 items: a, b, and c, then we have 8 item sets:


a. 1 of 0-itemsets: null
b. 3 of 1-itemsets: a, b, c
c. 3 of 2-itemsets: ab, ac, bc
d. 1 of 3-itemsets: abc
▪ Follows the Pascal’s triangle.
3. Support Count

Support count, which refers to the number of transactions that contain a particular itemset.
Example:
a. {cheese} 1-itemset appears in 4 trx, then {cheese} support count is 4
b. {umbrella, bread} 2-itemset appears in 3 trx, then {umbrella, bread} support count is 3
Support score is calculated as: support count/no. of all trx
Example:
a. Support score for {cheese} is 4/10=0.4
b. Support score for {umbrella, bread} is 3/10=0.3
4. Association Rules: An implication expression of the form X → Y , where X and Y are disjoint
item sets, i.e., X ∩ Y = ∅.
▪ The strength of an association rule can be measured in terms of its support and confidence.
▪ Confidence determines how frequently items in Y appear in transactions that contain X
▪ Consider the rule {Milk, Diapers} → {Beer}. Since the support count for {Milk, Diapers, Beer} is 2 and
the total number of transactions is 5, the rule’s support is 2/5 = 0.4.
▪ The rule’s confidence is obtained by dividing the support count for {Milk, Diapers, Beer} by the support
count for {Milk, Diapers}. Since there are 3 transactions that contain milk and diapers, the confidence
for this rule is 2/3 = 0.67.
▪ The association rule mining problem can be formally stated as follows: Definition 6.1 (Association Rule
Discovery). Given a set of transactions T, find all the rules having support ≥ minSupand confidence ≥ minConf,
where minSupand minConfare the corresponding support and confidence thresholds.
▪ However, to calculate support and confidence for all possible rules are waste of time since there are so many
possible rules generated from a small dataset containing d items: R = 3d – 2d+1 + 1.
▪ Hence, less frequent itemset pruning is performed.
Frequent Itemset Generation
1. The first step is to generate frequent itemset
2. The Apriori principle is one simplest method, given minSup
3. Principles:
a. If an itemset is frequent, then all of its subsets must also be frequent.
b. If an itemset is infrequent, then all its supersets are infrequent.

Rule Generation → Confidence-based Pruning


If a rule X → Y -X does not satisfy the confidence threshold, then any rule X’ → Y –X’, where X’ is
a subset of X, must not satisfy the confidence threshold as well
Example: X = {a, b, c}; Y={a, b, c, d};
1. X → Y –X
2. Rule1: {a, b, c} → {a, b, c, d} –{a, b, c}
3. Rule1: {a, b, c} → {d} : Calculate confidence for this rule! If conf(Rule1) < minConf, then
reject Rule1
4. If we reject Rule1, we should also reject Rule2:{a, b} →{a, b, c, d} – {a, b} because {a, b} is
a subset of {a, b, c}

S-ar putea să vă placă și