Sunteți pe pagina 1din 45

Data Mining

Why Data Mining?

Credit Ratings/Target Marketing:


Given a database of 100,000 names, which persons are the least likely to default on their credit cards? Identify likely responders to sales promotions Which types of transactions are likely to be fraudulent, given the demographics and transactional history of a particular customer? Which of my customers are likely to be the most loyal, and which are most likely to leave for a competitor? :

Fraud Detection

Customer Relationship Management:

Data Mining helps to extract such information

Data Mining

Process of semi-automatically analyzing large databases to find patterns that are:

valid: hold on new data with some certainty novel: non-obvious to the system useful: should be possible to act on the item understandable: human should be able to interpret the pattern

Also known as Knowledge Discovery in Databases (KDD)

Knowledge Discovery

The ultimate goal of knowledge discovery is to find out the patterns that are hidden among the huge sets of data and interpret them to useful knowledge and information. Knowledge discovery is a process that extracts implicit, potentially useful or previously unknown information from the databases.

Knowledge Discovery Process

Contd

Data comes from variety of sources is integrated into a single data store called target data. Data is then pre-processed and transformed into standard format. The data mining algorithms process the data to the output in form of patterns or rules. Then those patterns and rules are interpreted to new or useful knowledge or information. Data mining is a central part of knowledge discovery process.

Data Mining

Data Mining refers to developing business intelligence from data that an organization collects, organizes, and processes. Data Mining techniques are used by the organizations to gain better understanding of their customers and their own operations.

Contd

It uses statistical, mathematical, artificial intelligence, and machine- learning techniques to extract and identify useful information and subsequent knowledge from large databases. Data Mining, is the process of extracting patterns from large data sets by combining methods from statistics and artificial intelligence with database management. Data mining is seen as an increasingly important tool by modern business to transform data into business intelligence giving an informational advantage.

Contd

It includes tasks such as knowledge extraction, data archaeology, data exploration, data pattern processing, data dredging and information harvesting.

Characteristics & Objectives of Data Mining


Data must be consolidated in a data warehouse. Data mining environment is a client/server architecture or web-based architecture. Data mining tools are combined with spreadsheets and other software development tools because data can be analyzed and processed quickly. The concept of parallel processing is to be introduced for data mining as it contains large amount of data.

Data Mining Processes

The Cross-Industry Standard Process for Data Mining (CRISP-DM)

Business understanding
To understand business objectives clearly and make sure to find out what the client really wants to achieve. To assess the current situation by finding about the resources, assumptions, constraints and other important factors which should be considered. To create data mining goals to achieve the business objective and within the current situation. A good data mining plan has to be established to achieve both business and data mining goals.

Data understanding
-

It starts with initial data collection that collects data from available sources to get familiar with data. Some important activities must be carried including data load and data integration in order to make the data collection successfully. Next, the gross or surface properties of acquired data need to be examined carefully and reported. Then, the data need to be explored by tackling the data mining questions, which can be addressed using querying, reporting and visualization. Finally, the data quality must be examined by answering some important questions such as Is the acquired data complete?, Is there any missing values in the acquired data?

Data preparation
-

It consumes about 90% of the time. The outcome of the data preparation phase is the final data set. Once data sources available are identified, they need to be selected, cleaned, constructed and formatted into the desired form. The data exploration task at a greater depth may be carried during this phase to notice the patterns based on business understandings.

Modeling
-

Modeling techniques have to be selected to be used for the prepared dataset. Next, the test scenario must be generated to validate the models quality and validity. Then, one or more models are created by running the modeling tool on the prepared dataset. Lastly, models need to be assessed carefully by involving stakeholders to make sure that created models are meeting business initiatives.

Evaluation
-

In the first phase, model results must be evaluated in the context of business objectives. New business requirements may be raised due to new patterns has been discovered in the model results or from other factors. Gaining business understanding is an iterative process in data mining. Then go or no-go decision must be made in this step to move to the deployment phase.

Deployment
-

The knowledge or information that gain through data mining process needs to be presented in such a way that stakeholders can use it when they want it. Based on the business requirements, the deployment phase could be simple or complex. In this phase, maintenance and monitoring plans have to be created for deployment and future supports.

Architecture of Data Mining


No-Coupling Loose Coupling Semi-tight Coupling Tight Coupling

No-Coupling
- Data mining system does not utilize any functionality of a database or data warehouse system. - It retrieves data from a particular data sources such as file system, processes data using major data mining algorithms and stores results into file system. - The no-coupling architecture is considered as a poor architecture for data mining system however it is used for simple data mining processes.

Loose Coupling
- Data mining system uses database or data warehouse for data retrieval. - It retrieves data from database or data warehouse, processes data using data mining algorithms and stores the result in those systems. - This architecture is mainly for memory-based data mining system that does not require high scalability and high performance.

Semi-tight Coupling
- Besides linking to DB or DW, data mining system uses several features of database or data warehouse systems to perform some data mining tasks including sorting, indexing, aggregationetc. - In this architecture, some intermediate result can be stored in database or data warehouse system for better performance.

Tight Coupling
- Database or Data warehouse is treated as an information retrieval component of data mining system using integration. - This architecture provides system scalability, high performance and integrated information.

Tight Coupling Data Mining Architecture

Contd

There are three tiers in the tight-coupling data mining architecture:

Data layer: It can be database and/or data warehouse systems. This layer is an interface for all data sources. Data mining results are stored in data layer so it can be presented to end-user in the form of reports or other kind of visualization. Data mining application layer : used to retrieve data from database. Some transformation routine can be performed here to transform data into desired format. Then data is processed using various data mining algorithms. Front-end layer: provides intuitive and friendly user interface for end-user to interact with data mining system. Data mining results presented in visualization form to the user in the front-end layer.

How Data Mining works?

3 methods are used to identify patterns in data:


Simple Models (Ex. SQL, OLAP, Human Judgment) Intermediate Models (Ex. Regression, Decision Trees, Clustering) Complex Models (Ex. Neural Networks)

Data Mining algorithms are classified into five categories:


Classification Clustering Association Prediction Sequence Discovery

Classification (Supervised learning)


Is a classic data mining technique based on machine learning. To analyze the historical data stored in a database and automatically generate a model that can predict future behaviour. Is used to classify each item in a set of data into one of predefined set of classes or groups. Tools Used: Neural Networks, Decision Trees, if-then-else rules without tree structure.

Clustering

Objective: To partition/optimize a database into


segments/clusters whose members share similar qualities within each group (max similarities) and the members across the groups have minimum similarity.

Tools

used:

Neural

Networks,

Optimization

techniques, etc.

Association

a pattern is discovered based on a relationship of a particular item on other items in the same transaction. Often called as Market Basket Analysis. For example, the association technique is used in market basket analysis to identify what products that customers frequently purchase together. Based on this data, businesses can have corresponding marketing campaign to sell more products to make more profit.

Prediction

Discover relationships between independent variables and relationships between dependent variables and independent variables. For example, prediction analysis technique can be used in sale to predict profit for the future if we consider sale is an independent variable, profit could be a dependent variable. Then based on the historical sale and profit data, we can draw profit prediction.

Sequence Discovery

Identification of associations over time. Sequence Discovery techniques are used to keep track of elapsed time between associated events and the frequency of occurring of sequences. This info could be used to increase sales or to detect fraud.

Regression

Is a form of estimation. Is statistical technique that is used to map data to prediction value. Two types:

Linear Regression Non-Linear Regression

Forecasting

Is another form of estimation. It estimates future values based on patterns within large set of data (Ex. Demand Forecasting).

Data Mining Tools


Basically classified on the structure of data and the data mining algorithms used. Following are some important tools in data mining:

Decision Trees Statistical Methods Case-based reasoning Neural computing Intelligent agents Genetic algorithms

Decision Trees

Used in classification and clustering methods. Break down problems into discrete subsets. A DT can be defined as a root followed by internal nodes. Each node is labeled with a question. The arcs associated with each node specify all possible responses and each response represents a probable outcome.

Statistical Methods

Include linear and non-linear regression, point estimation, probability distribution, correlations and cluster analysis.

Case-based reasoning

Is a cognitive approach. This approach uses historical cases to recognize patterns.

Neural Networks/Computing

Works in a manner similar to how the neurons of the human brain functions. This approach examines a massive amount of historical data for patterns. Application areas are Financial Services and Manufacturing sectors.

Intelligent Agents

Is the most promising approach for retrieving info from external databases. Web-based data mining applications uses intelligent agents to extract right info through Internet.

Genetic Algorithms

Basically works on the principle of expansion of possible outcomes. If a fixed number of possible outcomes are given, then this algorithm is used to provide new and better solutions to a problem. It is used for Clustering and Association methods.

Applications of Data Mining


Sales/Marketing Banking / Finance Health Care and Insurance Transportation Medicine

Data Mining Applications in Sales/Marketing

Data mining enables the businesses to understand the patterns hidden inside past purchase transactions, thus helping in plan and launch new marketing campaigns in prompt and cost effective way. Data mining is used for market basket analysis to provides insight information on what product combinations were purchased, when they were bought and in what sequence by customers. This information helps businesses to promote their most profitable products to maximize the profit. In addition, it encourages customers to purchase related products that they may have been missed or overlooked. Retails companies uses data mining to identify customers behavior buying patterns.

Data Mining Applications in Banking / Finance


Distributed data mining techniques have been researched, modeled and developed to help credit card fraud detection. Data mining is used to identify customers loyalty by analyzing the data of customers purchasing activities such as the data of frequency of purchase in a period of time, total monetary value of all purchases and when was the last purchase. After analyzing those dimensions, the relative measure is generated for each customer. The higher of the score, the more relative loyal the customer is. To help bank to retain credit card customers, data mining is used. By analyzing the past data, data mining can help banks to predict customers that likely to change their credit card affiliation so they can plan and launch different special offers to retain those customers. Credit card spending by customer groups can be identified by using data mining. The hidden correlations between different financial indicators can be discovered by using data mining. From historical market data, data mining enable to identify stock trading rules also.

Data Mining Applications in Health Care and Insurance

The growth of the insurance industry is entirely depends on the ability of converting data into the knowledge, information or intelligence about customers, competitors and its markets. Data mining is applied in insurance industry lately but brought tremendous competitive advantages to the companies who have implemented it successfully. The data mining applications in insurance industry are listed below:

Data mining is applied in claims analysis such as identifying which medical procedures are claimed together. Data mining enables to forecasts which customers will potentially purchase new policies. Data mining allows insurance companies to detect risky customers behavior patterns. Data mining helps detect fraudulent behavior.

Data Mining Applications in Transportation

Data mining helps to determine the distribution schedules among warehouses and outlets and analyze loading patterns.

Data Mining Applications in Medicine


Data mining enables to characterize patient activities to see coming office visits. Data mining help identify the patterns of successful medical therapies for different illnesses.

S-ar putea să vă placă și