Edited Data Mining Metrices

DATA MINING METRICES
INTRODUCTION
The simple productivity measures and hard constraints used in many paratransit
vehicle scheduling software programs do not fully capture the interest of all the stakeholders
in a typical paratransit organization (e.g., passengers, drivers, municipal government). As a
result, many paratransit agencies still retain a human scheduler to look through all of the
schedules to manually pick out impractical, unacceptable runs. (A run is considered one
vehicle's schedule for one day.) The goal of this research was to develop a systematic tool
that can compute all the relevant performance metrics of a run, predict its overall quality, and
identify bad runs automatically.
This assignment presents a methodology that includes a number of performance

metrics reflecting the key interests of the stakeholders (e.g., number of passengers per vehicle
per hour, dead-heading time, passenger wait time, passenger ride time, and degree of
zigzagging) and a data-mining tool to fit the metrics to the ratings provided by experienced
schedulers. The encouraging preliminary results suggest that the proposed methodology can
be easily extended to and implemented in other paratransit organizations to improve
efficiency by effectively detecting poor schedules. Many criteria can be used to evaluate the
performance of supervised learning. Different criteria are appropriate in different settings,
and it is not always clear which criteria to use.
A further complication is that learning methods that perform well on one criterion
may not perform well on other criteria. For example, SVMs and boosting are designed to
optimize accuracy, whereas neural nets typically optimize squared error or cross entropy. We
conducted an empirical study using a variety of learning methods (SVMs, neural nets, k-
nearest neighbor, bagged and boosted trees, and boosted stumps) to compare nine boolean
classification performance metrics: Accuracy, Lift, F-Score, Area under the ROC Curve,
Average Precision, Precision/Recall Break-Even Point, Squared Error, Cross Entropy, and
Probability Calibration. Multidimensional scaling (MDS) shows that these metrics span a low
dimensional manifold.
1
The three metrics that are appropriate when predictions are interpreted as
probabilities: squared error, cross entropy, and calibration, lay in one part of metric space far
away from metrics that depend on the relative order of the predicted values: ROC area,
average precision, break-even point, and lift. In between them fall two metrics that depend on
comparing predictions to a threshold: accuracy and F-score. As expected, maximum margin
methods such as SVMs and boosted trees have excellent performance on metrics like
accuracy, but perform poorly on probability metrics such as squared error.
What was not expected was that the margin methods have excellent performance on
ordering metrics such as ROC area and average precision. We introduce a new metric, SAR,
that combines squared error, accuracy, and ROC area into one metric. MDS and correlation
analysis shows that SAR is centrally located and correlates well with other metrics,
suggesting that it is a good general purpose metric to use when more specific criteria are not
known.
We investigate the use of data mining for the analysis of software metric databases,
and some of the issues in this application domain. Software metrics are collected at various
phases of the software development process, in order to monitor and control the quality of a
software product. However, software quality control is complicated by the complex
relationship between these metrics and the attributes of a software development process. Data
mining has been proposed as a potential technology for supporting and enhancing our
understanding of software metrics and their relationship to software quality.
NINE METRICS USED IN DATA MINING
Predictive analytics enables you to develop mathematical models to help you better
understand the variables driving success. Predictive analytics relies on formulas that compare
past successes and failures, and then uses those formulas to predict future outcomes.
Predictive analytics, pattern recognition, and classification problems are not new. Long used
in the financial services and insurance industries, predictive analytics is about using statistics,
data mining, and game theory to analyze current and historical facts in order to make
predictions about future events. The value of predictive analytics is obvious. The more you
understand customer behavior and motivations, the more effective your marketing will be.
2
1. Regression analysis. Regression models are the mainstay of predictive analytics.
The linear regression model analyzes the relationship between the response or dependent
variable and a set of independent or predictor variables. That relationship is expressed as an
equation that predicts the response variable as a linear function of the parameters.
2. Choice modeling. Choice modeling is an accurate and general-purpose tool for

making probabilistic predictions about decision-making behavior. It behooves every
organization to target its marketing efforts at customers who have the highest probabilities of
purchase.
Choice models are used to identify the most important factors driving customer
choices. Typically, the choice model enables a firm to compute an individual's likelihood of
purchase, or other behavioral response, based on variables that the firm has in its database,
such as geo-demographics, past purchase behavior for similar products, attitudes, or
psychographics.
3. Rule induction. Rule induction involves developing formal rules that are extracted
from a set of observations. The rules extracted may represent a scientific model of the data or
local patterns in the data. One major rule-induction paradigm is the association rule.
Association rules are about discovering interesting relationships between variables in large
databases. It is a technique applied in data mining and uses rules to discover regularities
between products. For example, if someone buys peanut butter and jelly, he or she is likely to
buy bread. The idea behind association rules is to understand when a customer does X, he or
she will most likely do Y. Understanding those kinds of relationships can help with
forecasting sales, promotional pricing, or product placements.
4. Network/Link Analysis. This is another technique for associating like records. Link
analysis is a subset of network analysis. It explores relationships and associations among
many objects of different types that are not apparent from isolated pieces of information. It is
commonly used for fraud detection and by law enforcement. You may be familiar with link
analysis, since several Web-search ranking algorithms use the technique.
5. Clustering/Ensembles. Cluster analysis, or clustering, is a way to categorize a

collection of "objects," such as survey respondents, into groups or clusters to look for
patterns. Ensemble analysis is a newer approach that leverages multiple cluster solutions (an
3
ensemble of potential solutions). There are various ways to cluster or create ensembles.
Regardless of the method, the purpose is generally the sameto use cluster analysis to
partition into a group of segments and target markets to better understand and predict the
behaviors and preferences of the segments. Clustering is a valuable predictive-analytics
approach when it comes to product positioning, new-product development, usage habits,
product requirements, and selecting test markets.
6. Neural networks. Neural networks were designed to mimic how the brain learns
and analyzes information. Organizations develop and apply artificial neural networks to
predictive analytics in order to create a single framework.
The idea is that a neural network is much more efficient and accurate in circumstances
where complex predictive analytics is required, because neural networks comprise a series of
interconnected calculating nodes that are designed to map a set of inputs into one or more
output signals. Neural networks are ideal for deriving meaning from complicated or
imprecise data and can be used to extract patterns and detect trends that are too complex to be
noticed by humans or other computer techniques. Marketing organizations find neural
networks useful for predicting customer demand and customer segmentation.
7. Memory-based reasoning (MBR)/Case-based reasoning. This technique has results

similar to a neural network's but goes about it differently. MBR looks for "neighbor" kind of
data rather than patterns. It solves new problems based on the solutions of similar past
problems. MBR is an empirical classification method and operates by comparing new
unclassified records with known examples and patterns.
8. Decision trees. Decision trees use real data-mining algorithms to help with
classification. A decision-tree process will generate the rules followed in a process. Decision
trees are useful for helping you choose among several courses of action and enable you to
explore the possible outcomes for various options in order to assess the risk and rewards for
each potential course of action. Such an analysis is useful when you need to choose among
different strategies or investment opportunities, and especially when you have limited
resources.
9. Uplift modeling, aka net-response modeling or incremental-response modeling.

This technique directly models the incremental impact of targeting marketing activities.
4
The uplift of a marketing campaign is usually defined as the difference in response
rates between a treated group and a randomized control group. Uplift modeling uses a
randomized scientific control to measure the effectiveness of a marketing action and to build
a model that predicts the incremental response to the marketing action.
CONCLUSION
The detection of function clones in software systems is valuable for the code
adaptation and error checking maintenance activities. This assignment presents an efficient
metrics-based data mining clone detection approach. First, metrics are collected for all
functions in the software system.
A data mining algorithm, fractal clustering, is then utilized to partition the software
system into a relatively small number of clusters. Each of the resulting clusters encapsulates
functions that are within a specific proximity of each other in the metrics space. Finally,
clone classes, rather than pairs, are easily extracted from the resulting clusters. For large
software systems, the approach is very space efficient and linear in the size of the data set.
Evaluation is performed using medium and large open source software systems. In this
evaluation, the effect of the chosen metrics on the detection precision is investigated.
REFERENCES
http://www.marketingprofs.com/articles/2010/3567/the-nine-most-common-data-
mining-techniques-used-in-predictive-analytics
http://www.networkworld.com/article/2231920/microsoft-subnet/data-mining-
your-performance-metrics---uncover-that-nugget---.html
http://trrjournalonline.trb.org/doi/abs/10.3141/2072-15?journalCode=trr
http://dl.acm.org/citation.cfm?id=1014063

Edited Data Mining Metrices

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Edited Data Mining Metrices

Încărcat de

Drepturi de autor:

Formate disponibile

DATA MINING METRICES

This assignment presents a methodology that includes a number of performance

NINE METRICS USED IN DATA MINING

2. Choice modeling. Choice modeling is an accurate and general-purpose tool for

5. Clustering/Ensembles. Cluster analysis, or clustering, is a way to categorize a

7. Memory-based reasoning (MBR)/Case-based reasoning. This technique has results

9. Uplift modeling, aka net-response modeling or incremental-response modeling.

S-ar putea să vă placă și