Documente Academic
Documente Profesional
Documente Cultură
Abstract- Privacy Preserving Data Mining has become very way, so that the private data and private knowledge remain
popular for protecting the confidential knowledge which was private even after the mining process [8]. Data modification
extracted from the data mining techniques. Privacy preserving is one of the privacy preserving techniques used to modify
data mining is nothing but the study of how to produce valid the sensitive or original information available in the
mining models and patterns without disclosing private database that needs to be released to the public. It ensures
information. Several techniques are used for protecting the high privacy protection.
sensitive data. Some of them are statistical, cryptographic,
randomization, k-anonymity model, l-diversity and etc. In this
work, we have analyzed the two statistical disclosure control
The rest of this paper is organized as follows. In Section 2,
techniques i.e additive noise and micro aggregation. We have we present an overview of micro data and masking
examined the clustering performance of additive noise and techniques. Section 3 discusses different types of micro data
micro aggregation techniques. The experimental results show protection techniques. Additive noise and micro aggregation
that the clustering performance of additive noise technique is techniques are discussed in section 4. Section 5 gives the
comparatively better than micro aggregation. performance results of additive noise and micro aggregation.
Conclusions are given in Section 6.
Keywords- Data Perturbation, Micro Aggregation, Additive
Noise, K-means clustering
2. Micro Data
1. Introduction
Protecting static individual data is called micro data. It can
The problem of privacy-preserving data mining has become be represented as tables. It consists of tuples (records) with
more important in recent years because of the increasing values from a set of attributes. A micro data set V is a file
ability to store personal data about users, and the increasing with n records, where each record contains m attributes on
sophistication of data mining algorithms to leverage this an individual respondent [3]. The attributes can be classified
information. Many data mining applications such as in four categories which are not necessarily disjoint:
financial transactions, health-care records, and network
communication traffic are deal with private sensitive data. Ø Identifiers. These are attributes that unambiguously
Data is an important asset to business organization and identify the respondent. Examples are the passport
governments for decision making by analyzing it. Privacy number, social security number, name surname, etc.
regulations and other privacy concerns may prevent data
owners from sharing information for data analysis. In order Ø Quasi-identifiers or key attributes. These are attributes
to share data while preserving privacy data owner must which identify the respondent with some degree of
come up with a solution which achieves the dual goal of ambiguity. Examples are address, gender, age,
privacy preservation as well as accurate data mining result. telephone number, etc.
The main consideration in privacy preserving data mining is
Ø Confidential outcome attributes. These are attributes
twofold. First, sensitive raw data like identifiers, names,
which contain sensitive information on the respondent.
addresses and the like should be modified or trimmed out
Examples are salary, religion, political affiliation,
from the original database. Second, sensitive knowledge
health condition, etc.
which can be mined from a database by using data mining
algorithms should also be excluded, because such knowledge
can equally well compromise data privacy. Ø Non-confidential outcome attributes. Those attribute
which do not fall in any of the categories above.
The main objective in privacy preserving data mining is to
develop algorithms for modifying the original data in some
146 (IJCNS) International Journal of Computer and Network Security,
Vol. 2, No. 10, 2010
Ø MASSC etc.
Ø Sampling
Ø Local Suppression
Ø Global Recoding
Ø Top-Coding
Ø Bottom-Coding
Ø Generalization
3.1 Masking Techniques The original set of tuples in a microdata table is replaced
with a new set of tuples generated in such a way to preserve
Protecting sensitive data is a very significant issue in the the key statistical properties of the original data. The
government, public and private bodies. Masking techniques generation process is usually based on a statistical model
are used to prevent confidential information in the table. and the key statistical properties that are not included in the
Masking techniques can operate on different data types. model will not be necessarily respected by the synthetic data.
Data types can be categorized as follows. Since the released micro data table contains synthetic data,
the re-identification risk is reduced. The techniques are
Ø Continuous. An attribute is said to be continuous if it is divided into two categories: fully synthetic techniques and
numerical and arithmetic operations are defined on it. partially synthetic techniques. The first category contains
For instance, attributes age and income are continuous techniques that generate a completely new set of data, while
attributes. the techniques in the second category merge the original
data with synthetic data.
Ø Categorical. An attribute is said to be categorical if it can
assume a limited and specified set of values and 3.2.1 Fully Synthetic Techniques
arithmetic operations do not have sense on it. For
instance, attributes marital status and sex are Ø Bootstrap
categorical attributes. Ø Cholesky Decomposition
Ø Multiple Imputation
Masking techniques are classified into two categories Ø Maximum Entropy
Ø Latin Hypercube Sampling
Ø Perturbative
3.2.2 Partially Synthetic Techniques
Ø Non- Perturbative
Ø IPSO (Information Preserving Statistical
3.1.1 Perturbative Masking
Obfuscation)
Ø Hybrid Masking Random Response
Perturbation is nothing but altering an attribute value by a
Ø Blank and Impute
new value. The data set are distorted before publication.
Ø SMIKe (Selective Multiple Imputation of Keys)
Data is distorted in some way that affects the protected data
Ø Multiply Imputed Partially Synthetic Dataset [3]
set, i.e. it may contain some errors. In this way the original
dataset may disappear and new unique combinations of data
4. Analysis of the SDC techniques
items may appear in the perturbed dataset; in perturbation
method statistics computed on the perturbed dataset do not
The main steps involved in this work are,
differ from the statistics obtained on the original dataset [3].
Some of the perturbative masking methods are,
• Sensitive numerical data item is selected from the
database
Ø Micro aggregation
Ø Rank swapping • Modifying the sensitive data item using micro
Ø Additive noise aggregation and additive noise
Ø Rounding • Analyzing the statistical performance
Ø Resampling • Analyzing the accuracy of privacy protection
Ø PRAM • Evaluating the clustering accuracy
(IJCNS) International Journal of Computer and Network Security, 147
Vol. 2, No. 10, 2010
After modifying the values the k means algorithm is applied 5. Experimental Results
to find, whether the original value and the modified value in In order to conduct the experiments, synthetic employee
the micro aggregation table are in the same cluster. dataset can be created with 500 records. From this dataset,
we select the sensitive numeric attribute, income. Additive
noise and micro aggregation techniques are used for
4.2. Additive Noise
modifying the attribute income.
It perturbs a sensitive attribute by adding or by multiplying The following performance factors are considered for
it with a random variable with a given distribution. [2] evaluating the two techniques
Ø Masking by uncorrelated noise addition 5.1 Statistical Calculations
The statistical properties mean, standard deviation and
The vector of observations xj for the j-th attribute of the variance of modified data can be compared with the original
original dataset Xj is replaced by a vector data. Both the techniques were produced the same results.
zj = xj +εj
where εj is a vector of normally distributed errors drawn
from a random variable εj ∼ N(0, σ2εj ), such that Cov(εt, εl)
= 0 for all t ≠ l. This does not preserve variances nor
correlations.
ACKNOWLEDGEMENT
Authors Profile