Sunteți pe pagina 1din 4

Clustering,

why 2 steps (clase 10)


Data mining: exploration and analysis of large quantities of data in order to discover
valid, novel, potentially useful, and ultimately understandable patterns in data.
Primary objective of cluster analysis is to define the structure of the data by placing
the most similar observations into groups.
Functional Reasons: exploratory tool designed to reveal natural groupings within
data set that would otherwise not be apparent.
- Ability to create clusters based on both categorical and continuous variables
- Automatic selection of the number of clusters
- The ability to analyze large data files efficiently
Technical Reasons: rough estimate of the complexity

Step 1: SOM algorithm (Self Organized Maps)
1) Randomize the map's nodes' weight vectors
2) Traverse each input vector in the input data set
a. Traverse each node in the map
i. Use the Euclidean distance formula to find the similarity between the input
vector and the map's node's weight vector
ii. Track the node that produces the smallest distance (this node is the best
matching unit, BMU)
b. Update the nodes in the neighborhood of the BMU (including the BMU itself) by
pulling them closer to the input vector
i. Wv(s + 1) = Wv(s) + Θ(u, v, s) α(s)(D(t) - Wv(s))
3) Increase s and repeat from step 2 while

Step 2: K-means
1. Place K points into the space represented by the objects that are being clustered.
These points represent initial group centroids.
2. Assign each object to the group that has the closest centroid.
3. When all objects have been assigned, recalculate the positions of the K centroids.
4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a
separation of the objects into groups from which the metric to be minimized can be
calculated.
5. Repeat Steps 1, 2, 3 and 4 a given number of times

In K-means the nodes (centroids) are independent from each other. The winning node
gets the chance to adapt each self and only that. In SOM the nodes (centroids) are
placed onto a grid and so each node is consider to have some neighbors, the nodes
adjacent or near to it in repspect with their position on the grid. So the winning node
not only adapts itself but causes a change for its neighbors also. K-Means can be
considered a special case of SOM were no neighbors are taken into account when
modifing centroids vectors.



RFM (clase 13)
Recency: When did the customer make their last purchase?
Frequency: How often does the customer make a purchase?
Monetary: How much money does the customer spend?

v RFM Analysis helps companies decide which customers to give select offers and
promotional items.
v It is a way for companies to find ways to increase customer spending.
v Companies can use it to target lost customers and give them incentives to purchase
items
v RFM Analysis can help companies keep track of their customers and build a
relationship that can increase sales and productivity.
v It also identifies minimal losses – customers spend low dollar amounts in small
quantities

Break Even Response Rate = Unit Cost / Unit Profit
Only a small percentage (such as 5%) of customers respond to the typical offer
95% or more will not respond at all
RFM tells you which customers are most likely to be in the responsive 5%
Those who responded may not be your most profitable customers

Recommender Systems (clase 16)
Value for the customer
v Find things that are interesting
v Narrow down set of choices
v Help me explore the space of options
v Discover new things
v Entertainment
Value for the provider
v Additional and probably unique personalized service for the customer
v Increase trust and customer loyalty
v Increase sale, click through rates, conversión, etc
v Opportunities for promotion, persuasión
v Obtain more knowledge about customer

Techniques
v Collaborative: Recommend items based on ratings provided by diferent users. The
recommender identies a user's neighbors and generates recommendations based on
this neighborhood.
v Content-based: Recommend items based on two information sources: features of
products and ratings given by users. A classier is learnt to determine whether an item
should be recommended to a user or not.
v Demographic: Recommend items based on demographic information of the user.
v Knowledge-based: Recommend items based on a user's preferences and an item's
features. Domain knowledge can be used to match user preferences and item features.
Decision trees (clase 14)
Decision tree builds classification or regression models in the form of a tree structure.
It breaks down a dataset into smaller and smaller subsets while at the same time an
associated decision tree is incrementally developed. The final result is a tree
with decision nodes and leaf nodes. A decision node (e.g., Outlook) has two or more
branches (e.g., Sunny, Overcast and Rainy). Leaf node (e.g., Play) represents a
classification or decision

A decision tree is built top-down from a root node and involves partitioning the data
into subsets that contain instances with similar values (homogenous). ID3 algorithm
uses entropy to calculate the homogeneity of a sample. If the sample is completely
homogeneous the entropy is zero and if the sample is an equally divided it has
entropy of one









Overfitting
Overfitting is a significant practical difficulty for decision tree models and many other
predictive models. Overfitting happens when the learning algorithm continues to
develop hypotheses that reduce training set error at the cost of an
increased test set error. There are several approaches to avoiding overfitting in
building decision trees.
v Pre-pruning that stop growing the tree earlier, before it perfectly classifies the
training set.
v Post-pruning that allows the tree to perfectly classify the training set, and then
post prune the tree.

Practically, the second approach of post-pruning overfit trees is more successful
because it is not easy to precisely estimate when to stop growing the tree.

The important step of tree pruning is to define a criterion be used to determine the
correct final tree size using one of the following methods:
1) Use a distinct dataset from the training set (called validation set), to evaluate the
effect of post-pruning nodes from the tree.
2) Build the tree by using the training set, then apply a statistical test to estimate
whether pruning or expanding a particular node is likely to produce an
improvement beyond the training set.
a. Error estimation
b. Significance testing
3) Minimum Description Length principle : Use an explicit measure of the complexity
for encoding the training set and the decision tree, stopping growth of the tree
when this encoding size (size(tree) + size(misclassifications(tree)) is minimized.

Market Segmentation
The process of defining and subdividing a large homogenous market into clearly
identifiable segments having similar needs, wants, or demand characteristics.
Its objective is to design a marketing mix that precisely matches the expectations of
customers in the targeted segment
Few companies are big enough to supply the needs of an entire market; most
must breakdown the total demand into segments and choose those that the company
is best equipped to handle.
Four basic factors that affect market segmentation are
1) Clear identification of the segment,
2) Measurability of its effective size,
3) Its accessibility through promotional efforts, and
4) Its appropriateness to the policies and resources of the company.
The four basic market segmentation-strategies are based on
1) Behavioral: divides consumers according to their knowledge of response
2) Demographic: age, sex, generation, religion, occupation and education level
3) Psychographic: studying the activities, interests and opinions (AIO) of customers
4) Geographical differences: nations, states, region, countries, cities, neighbourhoods

S-ar putea să vă placă și