Sunteți pe pagina 1din 37

Authenticating and Reducing False hits in Mining

By Ujwala Bhoga

It is defined as the extraction of interesting patterns or knowledge from huge amount of data.
Data -Data are any facts, numbers, or text that can be processed by a computer Information-The patterns, associations, or relationships among all this data can provide information Knowledge -Information can be converted into knowledge about historical patterns and future trends

Data mining comes in two flavors Directed Directed data mining attempts to explain or categorize
some particular target field such as income o response.

Undirected Undirected data mining attempts to find patterns or

similarities among groups of records without the use of a particular target field or collection of predicted classes.

Data mining is largely concerned with building models. A model is simply an algorithm or set of rules that connects a collection of inputs to a particular target or outcome. Many problems of intellectual, economic, and business interest can be phrased in terms of the following tasks:
Classification Estimation Prediction Affinity grouping Clustering Description and Profiling

The first are examples of directed data mining, where the goal is to find the value of a particular target variable. Affinity grouping and clustering are undirected tasks where the goal is to uncover structure in data without respect to a particular target variable. Profiling is a descriptive task that may be either directed or undirected.

The most commonly used techniques in data mining are:

Artificial neural networks: Non-linear predictive models that learn through training and resemble biological neural networks in structure. Decision trees: Tree-shaped structures that represent sets of decisions. These decisions generate rules for the classification of a dataset. Specific decision tree methods include Classification and Regression Trees (CART) and Chi Square Automatic Interaction Detection (CHAID). Genetic algorithms: Optimization techniques that use processes such as genetic combination, mutation, and natural selection in a design based on the concepts of evolution. Nearest neighbor method: A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the k-nearest neighbor technique. Rule induction: The extraction of useful if-then rules from data based on statistical significance.

Decision Trees

Nearest Neighbor classification Neural Networks

Rule Induction

K-means Clustering


Data Mining Algorithms

A data mining algorithm is a set of heuristics and calculations that creates a data mining model from data. To create a model, the algorithm first analyzes the data you provide, looking for specific types of patterns or trends.
The mining model that an algorithm creates from your data can take various forms, including:
A set of clusters that describe how the cases in a dataset are related. A decision tree that predicts an outcome, and describes how different criteria affect that outcome. A mathematical model that forecasts sales. A set of rules that describe how products are grouped together in a transaction, and the probabilities that products are purchased together.

Analysis Services includes the following algorithm types: Classification algorithms predict one or more discrete variables, based
on the other attributes in the dataset.

Regression algorithms predict one or more continuous variables, such as

profit or loss, based on other attributes in the dataset.

Segmentation algorithms divide data into groups, or clusters, of items

that have similar properties. Association algorithms find correlations between different attributes in a dataset. The most common application of this kind of algorithm is for creating association rules, which can be used in a market basket analysis.

Sequence analysis algorithms summarize frequent sequences or

episodes in data, such as a Web path flow.

Experienced analysts will sometimes use one algorithm to determine the most effective inputs (that is, variables), and then apply a different algorithm to predict a specific outcome based on that data.

Nearest neighbor Method

A technique that classifies each record in a dataset based on a combination of the classes of the k record(s) most similar to it in a historical dataset (where k 1). Sometimes called the knearest neighbor technique.

K-Nearest-Neighbor (kNN) Models

Use entire training database as the model Find nearest data point and do the same thing as you did for that record





Very easy to implement. More difficult to use in production. Disadvantage: Huge Models

Authentication Authentication is the act of confirming the truth of an attribute of a datum or entity. This might involve confirming the identity of a person or software program, tracing the origins of an artifact, ensuring that a product is what its packaging and labeling claims to be. In private and public computer networks authentication is commonly done through the use of logon passwords. Knowledge of the password is assumed to guarantee that the user is authentic. Each user registers initially, using an assigned or self-declared password. On each subsequent use, the user must know and use the previously declared password. The weakness in this system for transactions that are significant is that passwords can often be stolen, accidentally revealed, or forgotten.

For this reason, Internet business and many other transactions require a more stringent authentication process. The use of digital certificates issued and verified by a Certificate Authority (CA) as part of a public key infrastructure is considered likely to become the standard way to perform authentication on the Internet.

Public Key Cryptography

Public-key cryptography refers to a cryptographic system requiring two separate keys, one to lock or encrypt the plaintext, and one to The two main branches of public key cryptography are: Public key encryption: a message encrypted with a recipient's public key cannot be decrypted by anyone except a possessor of the matching private key it is presumed that this will be the owner of that key and the person associated with the public key used. This is used for confidentiality. Digital signatures: a message signed with a sender's private key can be verified by anyone who has access to the sender's public key, thereby proving that the sender had access to the private key (and therefore is likely to be the person associated with the public key used), and the part of the message that has not been tampered with.

False hit
Generally the tem hit means successful search i.e., the required information has been found in the search by the given query. But, if the information required is not available in the database then it is known as false hit. False hits in the data mining increase the cost of the application. So, we have to reduce the false hits in order to improve the performance of the application. In this application false hits are reduced by storing the queries of the false hits in another database, so for the first time if the information is not available for the given query by the client that query will be saved as a false hit in false hit database. Whenever client gives the query first it searches in the false hit database.


Existing System:
Several application including image, medical, Time series and Document Databases involve high dimensional data. Similarity retrieval in these application based on low dimensional indexes, such as the R* Tree is very expensive due to the dimensionality curse. The system considers the query which is processing under the Nearest Neighbor but it should not be an authenticated because its providing the result-set with nearest data only. Disadvantages: This systems provided the record-set is fully authenticated Unable to use the public key cryptosystem. We have to search the nearest result accurately.

Proposed System:
The system provides authentication for processing the query is done by maintaining a dataset DB in server and it is signed by a trusted authority (e.g., the data owner, a notarization service). The signature is usually based on a public key cryptosystem. The server receives and processes queries from clients. Each query returns a result set and the database that satisfies certain predicates. Moreover the client must be able to establish that result set is correct i.e. it contains all records of database that satisfy the query condition and that these records have not been modified by the server or another entity. Since the signature captures the entire database and the server returns the verification objects then the clients can verify result set based on signature and the signers public key. In order to make easier this problem, we present a novel technique that reduces the size of each false hit.

Advantages: The system provides the result-set that resultset is accurate one. Using a public key Cryptosystem, the system provides the result-set is fully authenticated to the user and can visible with his signature. As we are using AMNN method, the client can visible the accurate data.

Operating system Front End Coding Language Backend : : : : Windows 7/ XP Professional Microsoft Visual Studio .Net 2008 Visual C# .Net SqlServer 2005



: : :



Use case diagram

Class diagram

Object diagram

State diagram

Activity diagram

Sequence diagram

Collaboration diagram

Component diagram