Sunteți pe pagina 1din 5

(IJCNS) International Journal of Computer and Network Security, 145

Vol. 2, No. 10, 2010

Evaluating Clustering Performance for the Protected


Data using Perturbative Masking Techniques in
Privacy Preserving Data Mining
S.Vijayarani1, Dr.A.Tamilarasi2
1,
School of Computer Science and Engg.,
Bharathiar University, Coimbatore, Tamilnadu,India
vijimohan_2000@yahoo.com
2
Dept. of MCA, Kongu Engg. College, Erode, Tamilnadu, India
drtamil@kongu.ac.in

Abstract- Privacy Preserving Data Mining has become very way, so that the private data and private knowledge remain
popular for protecting the confidential knowledge which was private even after the mining process [8]. Data modification
extracted from the data mining techniques. Privacy preserving is one of the privacy preserving techniques used to modify
data mining is nothing but the study of how to produce valid the sensitive or original information available in the
mining models and patterns without disclosing private database that needs to be released to the public. It ensures
information. Several techniques are used for protecting the high privacy protection.
sensitive data. Some of them are statistical, cryptographic,
randomization, k-anonymity model, l-diversity and etc. In this
work, we have analyzed the two statistical disclosure control
The rest of this paper is organized as follows. In Section 2,
techniques i.e additive noise and micro aggregation. We have we present an overview of micro data and masking
examined the clustering performance of additive noise and techniques. Section 3 discusses different types of micro data
micro aggregation techniques. The experimental results show protection techniques. Additive noise and micro aggregation
that the clustering performance of additive noise technique is techniques are discussed in section 4. Section 5 gives the
comparatively better than micro aggregation. performance results of additive noise and micro aggregation.
Conclusions are given in Section 6.
Keywords- Data Perturbation, Micro Aggregation, Additive
Noise, K-means clustering
2. Micro Data
1. Introduction
Protecting static individual data is called micro data. It can
The problem of privacy-preserving data mining has become be represented as tables. It consists of tuples (records) with
more important in recent years because of the increasing values from a set of attributes. A micro data set V is a file
ability to store personal data about users, and the increasing with n records, where each record contains m attributes on
sophistication of data mining algorithms to leverage this an individual respondent [3]. The attributes can be classified
information. Many data mining applications such as in four categories which are not necessarily disjoint:
financial transactions, health-care records, and network
communication traffic are deal with private sensitive data. Ø Identifiers. These are attributes that unambiguously
Data is an important asset to business organization and identify the respondent. Examples are the passport
governments for decision making by analyzing it. Privacy number, social security number, name surname, etc.
regulations and other privacy concerns may prevent data
owners from sharing information for data analysis. In order Ø Quasi-identifiers or key attributes. These are attributes
to share data while preserving privacy data owner must which identify the respondent with some degree of
come up with a solution which achieves the dual goal of ambiguity. Examples are address, gender, age,
privacy preservation as well as accurate data mining result. telephone number, etc.
The main consideration in privacy preserving data mining is
Ø Confidential outcome attributes. These are attributes
twofold. First, sensitive raw data like identifiers, names,
which contain sensitive information on the respondent.
addresses and the like should be modified or trimmed out
Examples are salary, religion, political affiliation,
from the original database. Second, sensitive knowledge
health condition, etc.
which can be mined from a database by using data mining
algorithms should also be excluded, because such knowledge
can equally well compromise data privacy. Ø Non-confidential outcome attributes. Those attribute
which do not fall in any of the categories above.
The main objective in privacy preserving data mining is to
develop algorithms for modifying the original data in some
146 (IJCNS) International Journal of Computer and Network Security,
Vol. 2, No. 10, 2010

Ø MASSC etc.

3. Classification of micro data protection


techniques (MPTs) [3] 3.1.2 Non-Perturbative Masking

Non-perturbative techniques produce protected microdata by


eliminating details from the original microdata. Some of the
Non-perturbative masking methods are

Ø Sampling
Ø Local Suppression
Ø Global Recoding
Ø Top-Coding
Ø Bottom-Coding
Ø Generalization

Figure 1. Micro-data Protection Techniques 3.2 Synthetic Techniques

3.1 Masking Techniques The original set of tuples in a microdata table is replaced
with a new set of tuples generated in such a way to preserve
Protecting sensitive data is a very significant issue in the the key statistical properties of the original data. The
government, public and private bodies. Masking techniques generation process is usually based on a statistical model
are used to prevent confidential information in the table. and the key statistical properties that are not included in the
Masking techniques can operate on different data types. model will not be necessarily respected by the synthetic data.
Data types can be categorized as follows. Since the released micro data table contains synthetic data,
the re-identification risk is reduced. The techniques are
Ø Continuous. An attribute is said to be continuous if it is divided into two categories: fully synthetic techniques and
numerical and arithmetic operations are defined on it. partially synthetic techniques. The first category contains
For instance, attributes age and income are continuous techniques that generate a completely new set of data, while
attributes. the techniques in the second category merge the original
data with synthetic data.
Ø Categorical. An attribute is said to be categorical if it can
assume a limited and specified set of values and 3.2.1 Fully Synthetic Techniques
arithmetic operations do not have sense on it. For
instance, attributes marital status and sex are Ø Bootstrap
categorical attributes. Ø Cholesky Decomposition
Ø Multiple Imputation
Masking techniques are classified into two categories Ø Maximum Entropy
Ø Latin Hypercube Sampling
Ø Perturbative
3.2.2 Partially Synthetic Techniques
Ø Non- Perturbative
Ø IPSO (Information Preserving Statistical
3.1.1 Perturbative Masking
Obfuscation)
Ø Hybrid Masking Random Response
Perturbation is nothing but altering an attribute value by a
Ø Blank and Impute
new value. The data set are distorted before publication.
Ø SMIKe (Selective Multiple Imputation of Keys)
Data is distorted in some way that affects the protected data
Ø Multiply Imputed Partially Synthetic Dataset [3]
set, i.e. it may contain some errors. In this way the original
dataset may disappear and new unique combinations of data
4. Analysis of the SDC techniques
items may appear in the perturbed dataset; in perturbation
method statistics computed on the perturbed dataset do not
The main steps involved in this work are,
differ from the statistics obtained on the original dataset [3].
Some of the perturbative masking methods are,
• Sensitive numerical data item is selected from the
database
Ø Micro aggregation
Ø Rank swapping • Modifying the sensitive data item using micro
Ø Additive noise aggregation and additive noise
Ø Rounding • Analyzing the statistical performance
Ø Resampling • Analyzing the accuracy of privacy protection
Ø PRAM • Evaluating the clustering accuracy
(IJCNS) International Journal of Computer and Network Security, 147
Vol. 2, No. 10, 2010

Correlated noise addition also preserves means and


4.1 Micro aggregation additionally allows preservation of correlation coefficients.
The difference with the previous method is that the
Micro aggregation is an SDC technique consisting in the covariance matrix of the errors is now proportional to the
aggregation of individual data. It can be considered as an covariance matrix of the original data, i.e. ε ∼ N(0,Σε),
SDC sub-discipline devoted to the protection of the micro where Σε = αΣ.
data. Micro aggregation can be seen as a clustering problem In this work we have used the given additive noise
with constraints on the size of the clusters. It is somehow algorithm.
related to other clustering problems (e.g., dimension
reduction or minimum squares design of clusters). However, • Consider a database D consists of T tuples.
the main difference of the micro aggregation problem is that D={t1,t2,…tn}. Each tuple in T consists of set of
it does not consider the number of clusters to generate or the attributes T={A1,A2,…Ap} where Ai Є T and Ti Є D
number of dimensions to reduce, but only the minimum • Identify the sensitive or confidential numeric attribute AR
number of elements that are grouped in the same cluster [9]. n
Any type of data, micro aggregation can be operationally • Calculate the mean ΣARi
defined in terms of the following two steps: i=1
• Initialize countgre=0 and countmin=0
Ø Partition: The set of original records is partitioned into • If ARi>=mean then
several clusters in such a way that records in the same {
cluster are similar to each other and so that the number Store these numbers separately
of records in each cluster is at least k. group1= ARi (i=1,..n)
countgre=coungre+1
Ø Aggregation: An aggregation operator (for example, the }
mean for continuous data or the median for categorical • else if ARi <mean then
data) is computed for each cluster and is used to replace {
the original records. In other words, each record in a Store these numbers separately
cluster is replaced by the cluster’s prototype. group2= ARi (i=1,..n)
countmin=countmin+1
From an operational point of view, micro aggregation }
applies • Calculate the noise1 value as 2*mean/countgre
• Calculate the noise2 value as 2*mean/countmin
Ø A clustering algorithm to a set of data obtaining a set of • Subtract the noise1 value from each data item in group1
clusters. Formally, the algorithm determines a partition • Add the noise2 value to each data item in group2
of the original data. Then, micro aggregation proceeds
• Now release the new modified sensitive data
by calculating a cluster representative for each cluster
• In the modified data find out the mean value which is
finally,
same as the original
Ø Each original datum is replaced by the corresponding
• Adding the noise1 and noise2 produce the result as 0.
cluster representative [7]

After modifying the values the k means algorithm is applied 5. Experimental Results
to find, whether the original value and the modified value in In order to conduct the experiments, synthetic employee
the micro aggregation table are in the same cluster. dataset can be created with 500 records. From this dataset,
we select the sensitive numeric attribute, income. Additive
noise and micro aggregation techniques are used for
4.2. Additive Noise
modifying the attribute income.
It perturbs a sensitive attribute by adding or by multiplying The following performance factors are considered for
it with a random variable with a given distribution. [2] evaluating the two techniques
Ø Masking by uncorrelated noise addition 5.1 Statistical Calculations
The statistical properties mean, standard deviation and
The vector of observations xj for the j-th attribute of the variance of modified data can be compared with the original
original dataset Xj is replaced by a vector data. Both the techniques were produced the same results.
zj = xj +εj
where εj is a vector of normally distributed errors drawn
from a random variable εj ∼ N(0, σ2εj ), such that Cov(εt, εl)
= 0 for all t ≠ l. This does not preserve variances nor
correlations.

Ø Masking by correlated noise addition.


148 (IJCNS) International Journal of Computer and Network Security,
Vol. 2, No. 10, 2010

with the modified clusters. The additive noise modified data


set clustering is same as the original clustering.

Figure 2. Statistical Performance

5.2. Privacy Protection


Figure 4.Accuracy of additive noise and micro aggregation.
In order to verify the privacy protection, we have analyzed
whether all the sensitive data items are modified or not. 6. Conclusions
From the results, we know that, there is a 100% privacy Preserving privacy in data mining activities is a very
protection. important issue in many applications. In this paper, we have
analyzed micro aggregation and additive noise perturbative
masking techniques in privacy preserving data mining.
Micro aggregation and additive noise performance are good
in statistical calculations and privacy protection. Then we
have used the modified dataset for clustering, the results
show that the additive noise technique is comparatively
better than micro aggregation.

ACKNOWLEDGEMENT

I would like to thank “The UGC, New Delhi” for providing


me the necessary funds.

Figure 3. Data Modification References


5.3 Accuracy of Clustering [1] Brand R (2002). “Micro data protection through noise
addition”. In Domingo-Ferrer J, editor, Inference
Ø K-Means Clustering Algorithm Control in Statistical Databases, vol. 2316 of LNCS,pp.
97{116. Springer, Berlin Heidelberg.
The k-means algorithm for partitioning, where each [2] Charu C.Aggarwal IBM T.J. Watson Research Center,
cluster’s center is represented by the mean value of the USA and Philip S. “Privacy preserving data mining:
objects in the cluster Models and algorithms” Yu University of Illinois at
Input Chicago, USA.
• K : the number of clusters [3] Ciriani, S.De Capitani di Vimercati, S.Foresti, and
• D : a data set containing n objects P.Samarati “Micro data protection” © Springer US,
Advances in Information Security (2007)
Output: Set of k clusters [4] Feng LI†, Jin MA, Jian-hua LI (School of Electronic
Information and Electrical Engineering,“ Distributed
Method anonymous data perturbation method for privacy-
preserving data mining”. Shanghai Jiao Tong
• arbitrarily choose k objects from D as the initial University, Shanghai 200030, China).
cluster centers; [5] G. R. Sullivan. The Use of Added Error to Avoid
• Repeat Disclosure in Microdata Releases. PhD thesis, Iowa State
• (re)assign each object to the cluster to which the University, 1989.
object is the most similar, based on the mean value [6] J. J. Kim. A method for limiting disclosure in microdata
of the objects in the cluster based on random noise and transformation. In
• Update the cluster means, i.e. calculate the mean Proceedings of the Section on Survey Research Methods,
value of the objects for each cluster pages 303–308, Alexandria VA, 1986. American
• Until no change Statistical Association.
[7] Vicen c Torra “Constrained micro aggregation: Adding
In order to verify the clustering accuracy we have used k- constraints for Data Editing” IIIA - Artificial
means clustering algorithm. Original clusters are compared Intelligence Research Institute, CSIC - Spanish Council
(IJCNS) International Journal of Computer and Network Security, 149
Vol. 2, No. 10, 2010

for Scientific Research, Campus UAB s/n, 08193


Bellaterra (Catalonia, Spain).
[8] Vassilios S. Veryhios, Elisa Bertino, Igor Nai Fovino
Loredana Parasiliti Provenza, Yucel Saygin,
Yanniseodoridis, “State-of-the-art in Privacy Preserving
Data Mining”, SIGMOD Record, Vol. 33, No. 1, March
2004.
[9] Xiaoxun Sun1 Hua Wang1 Jiuyong Li2 “Microdata
Protection Through Approximate Microaggregation”
2009, Australian Computer Society, Inc. Thirty-Second
Australasian Computer Science Conference
(ACSC2009), Wellington, New Zealand. Conferences in
Research and Practice in Information Technology
(CRPIT), Vol. 91. Bernard Mans, Ed.

Authors Profile

Mrs. S.Vijayarani has completed MCA and


M.Phil in Computer Science. She is working
as Assistant Professor in the School of
Computer Science and Engineering,
Bharathiar University, Coimbatore. She is
currently pursuing her Ph.D in the area of
privacy preserving data mining. She has
published two papers in international journal
and presented six research papers in
international and national conferences.

Dr. A.Tamilarasi is a Professor and Head in


the Departmentof MCA, Kongu Engineering
College, Perundurai. She has supervised a
number of Ph.D students. She has published a
number of research papers in national and
international journals and conference
proceedings.

S-ar putea să vă placă și