Sunteți pe pagina 1din 6

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 9 Sep 2013

ISSN: 2231-2803 http://www.ijcttjournal.org Page3349



Preventing Diversity Attacks in Privacy
Preserving Data Mining
Sailaja.R.J.L
#1
, P.Dayaker
*2

#1
M.Tech, Computer Science Engineering, MLRIT, Hyderabad, Andhra Pradesh, India
#
Assistant Professor, Department of CSE, MLRIT, Hyderabad, Andhra Pradesh, India

Abstract -- Data mining exposes dataset to the researchers. However, it is possible to establish sensitive information from the
given dataset. To overcome this problem Privacy Preserving Data Mining (PPDM) came into existence. PPDM ensures that
the datasets are generated in such a way that they can not disclose identity of the entities present in the dataset. Before data is
published, perturbation of data is made in order to preserve privacy in the data. Existing PPDM systems assumed single level
trust on data miners. Recently Li et al. relaxed this by introducing Multilevel Trust in PPDM. They believe that if the data
miner is more trusted less perturbation is required. However, malicious data miners can establish identity information by
combining multiple perturbed copies. It prevents malicious attacks and it is a true PPDM. In this paper we implement that
multilevel trust based PPDM which enables data owners to have freedom to choose the level of privacy needed. Based on this
trust level perturbations are made. We built a prototype application that demonstrates the proof of concept. The empirical
results revealed that the proposed approach is effective and offers flexibility to data owners.

Index Terms -- Data mining, privacy preserving data mining, random perturbation, multilevel trust

I. INTRODUCTION
Data mining is a process of extracting trends in the
historical data. In other words it provides business
intelligence that helps in taking good decisions which
will lead to profits. However, data owner is supposed
to give his data to data miner for the purpose of
extracting business intelligence or discovering
actionable knowledge from the data. When data
miner has malicious intentions, he can use the data
for knowing sensitive information and take advantage
of it for monetary gains. To avoid this problem
before giving data to data miner, it is perturbed in
such a way that sensitive information is encoded or
altered in order to ensure that the privacy of data is
preserved. This feature is known as Privacy
Preserving Data Mining (PPDM). There were many
researches on the PPDM [1], [2], [3], [4], [5]. All
these existing PPDM schemes assume single level
trust on data miner. It does mean that the data owner
generates only one perturbed copy. However, in the
real world different trust levels might be required. It
is the motivation behind exploring multilevel trust.
For instance a government organization might have
internal and external miners. Moreover, it also wants
to give data to public. In this case obviously it needs
to generate multiple copies of data with different
perturbation applied based on the trust level of the
data miners and general public. There is a problem in
this approach too. When adversaries are able to get
internal and external miner copies, they can establish
the identity information from the data. Therefore it is
very challenging to implement multiple level trust
based PPDM.
When compared to single level trust scenario, many
perturbed copies are required by the data owner to
ensure non-disclosure of sensitive details. The
number of perturbed copies depend on the trust level
of the data miner. If the miner is trusted more, then it
is likely that the number of perturbations of data to be
published is less. However, from multiple diverse and
perturbed copies the miner might produce original
information accurately. This is the problem with the
approach. Preventing such diversity attacks is a
challenging task in multi-level trust based PPDM.
Random Gaussian noise is added to overcome this
problem.
Recently Li et al. [6] proposed a multi-level trust
based PPDM that solves the problem of diversity
attacks. The key challenge in the construction of the
scheme is to ensure that the adversaries cant
establish original data from diversity of multiple
perturbed copies. This problem is overcome by
properly correlating perturbed copies with different
trust levels. In this paper we implement the scheme
proposed by Li et al. [6] and make a provision to end
users to choose levels trust and number of perturbed
copies required which do not allow diversity attacks.
The reminder of the paper is structured as follows.
Section II reviews related literature. Section III
describes the proposed multilevel truest based
PPDM. Section IV presents experimental results
while section V concludes the paper.

II. PRIOR WORKS
International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 9 Sep 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page3350

Privacy preserving data mining has been around for
many years. First of all it was introduced in [7] and
[8]. The problem of PPDM has attracted many
researchers. The existing work on this concept is
categorized into two types. The first one is SMC
(Secure Multiparty Computation) which can provide
highest level of privacy to data. It enables data
miners to perform mining operations on data without
disclosing sensitive information to data miners [7],
[9]. This is achieved by implementing algorithms
using genetic algorithms pertaining to SMC [10].
However, in practice, these algorithms are very
expensive and not feasible. To overcome this
drawback many solutions came into existence that are
more efficient than SMC. The solutions are nothing
but data mining techniques such as building decision
trees [7], association rule mining [9], clustering
through K-means algorithm [11] and other data
mining algorithms like frequent pattern mining [12].
For privacy preserving data mining in collaborative
fashion secure coprocessor was used for privacy
preserving. The second category is the approach
known as partial information hiding which ensures
good privacy and also performance. Many solutions
came into this category. For instance k-anonymity
[13], [14], [15], [16], [17], retention replacement
[18], [19], [20] and perturbation of data [4], [3], [2],
[1], [21] are some of the examples. The data
perturbation methods are of two classes. Thy are
known as additive [5], [1], [3], [8] and other scheme
is known as matrix multiplicative schemes [2], [22].
These methods are suitable for numeric data. To
solve the problem of PPDM, in [23] a new adversary
model is proposed.
Some of the existing protocols have drawback of
leaking information [24]. Private and threshold set is
introduced in [25] and they are algorithms that are
equity based. Reanonymization is the concept studied
to solve the problem of PPDM. This Study is made in
[26], [27], and [28]. IN this paper we study the
anonymization of data many times for privacy
preserving data mining with multiple levels of trust
computations. Our multilevel trust based approach is
effective as it relaxes the assumption of single level
trust on data miners.

III. MULTILEVEL TRUST BASED
PPDM
This section provides details about the PPDM scheme
which will support multi-level trust to protect privacy
of data which is given to data miners for discovering
knowledge. The existing solutions use single level
trust on data miners. It does mean that those systems
use a single perturbed copy of data. In this paper we
support multilevel trust based PPDM that helps data
owners to make multiple perturbed copies of the data.
For each data miner based on the trust level the data
owner generates different perturbed copy.

Privacy Goal
The design goal of the multilevel trust based PPDM
is that the data owner should be allowed to built
distinct perturbed data for each data miner. The data
owner distributes such copies to data miners. M is
assumed to be the number of copies generated by
data owner for ease of analysis. Based on the trust
level of data miner, the privacy of perturbed copy can
be computed as follows.


This will help data owner to control the data easily.
This is achieved by setting the privacy of perturbed
copies. However, this has a drawback when
adversaries launch attack by combining multiple
perturbed copies and generate the original data. To
overcome this problem Noise concept is introduced.
With this data owner produces another copy of
perturbed copy which will ensure more security. The
adversaries cant have diversified attacks. To achieve
this noise covariance matrix as part of corner-wave
property is used.

Batch Generation
When the data owner knows the number of data
miners, trust levels he generates corresponding
number of perturbed copies with privacy goal in
mind. This process is known as batch generation.
Two algorithms are proposed for noise generation in
parallel and sequentially. The parallel generation
algorithm is as shown in fig. 1.

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 9 Sep 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page3351


Fig. 1 Parallel generation algorithm
This algorithm generates noise components
simultaneously. The algorithm 2 is presented in
figure 2 which generates noise sequentially.


Fig. 2 Sequential Generation of Noise
This algorithm generates noise components
sequentially. It is a memory efficient solution. The
drawback of the parallel generation is that the data
owner should have known all the possible trust levels
required by all data miners priori. Therefore this is
not flexible. For on demand requirement, another
algorithm is proposed as presented in figure 3.

Fig. 3 On demand noise generation algorithm
This algorithm provides flexibility to data owner as it
can generate perturbed copies of data on demand.
This will achieve privacy goal specified and also
based on multilevel trust based. More details about
the technical details of multi-level trust based PPDM
can be found in [29].

IV. EXPERIMENTAL RESULTS
In the first set of experiments, algorithm 3 is used as
it gives flexibility to data owner. The experiments are
made on multiple perturbed copies which is the best
case attack scenario. We assume that all perturbed
copies are accessible to data miners to have such
scenario. The perturbed copies are releases one by
one therefore their availability also is influenced by
this. Experiments are made based on two sensitive
columns such as Age and Income. The experimental
results are compared with that of IN (independent
noise scheme).


0
0.1
0.2
0.3
0.4
0.5
0.6
1 10 20 30 40
N
o
r
m
a
l
i
z
e
d

E
s
t
i
m
a
t
i
o
n

E
r
r
o
r
Number of Perturbed Copies
IN KX=100
OURS KX=100
IN KX=200
OURS KX=200
IN KX=300
OURS KX=300
IN WITH PERFECT
KNOELEDGE
OURS WITH PERFECT
KNOWLEDGE
OUR LEAST PERTUBRED
International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 9 Sep 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page3352

Fig.4 Comparisons of average normalized estimation
error of the independent noise scheme (denoted as
IN) and our scheme (denoted as Ours) on the data
Age

As shown in the above figure represents horizontal
axis represents number of perturbed copies while
vertical axis represents normalized estimation error.


Fig. 5. Comparisons of average normalized
estimation error of the independent noise scheme
(denoted as IN) and our scheme (denoted as Ours) on
the data Income

As shown in the above figure represents horizontal
axis represents number of perturbed copies while
vertical axis represents normalized estimation error.


Fig. 6. The corresponding histogram of the estimation
error when M =5

As shown in the above figure 6 represents the
horizontal axis represents X while vertical axis
represents Proportion.

Fig. 7. The cumulative histogram of the estimation
error when M =5
As shown in the above figure 7 represents the
horizontal axis represents X while vertical axis
represents Proportion.

Fig. 8. The corresponding histogram of the estimation
error when M =10

As shown in the above figure 8 represents the
horizontal axis represents X while vertical axis
represents Proportion.
0
0.1
0.2
0.3
0.4
0.5
0.6
1 5 102520 25303540
N
o
r
m
a
l
i
z
e
d

E
s
t
i
m
a
t
i
o
n

E
r
r
o
r
Number of perturbed copies
IN KX=100
OURS KX=200
IN KX=200
OURS KX=200
IN KX=300
OURS KX=300
IN WITH PERFECT
KNOELEDGE
0
0.2
0.4
0.6
0.8
1
1.2
10 20 30 40 50
P
r
o
p
o
r
t
i
o
n
X
Our Scheme
Independen
t Noise
Scheme
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
10 20 30 40 50
P
r
o
p
o
r
t
i
o
n
X
Our
Scheme
Independe
nt Noise
Scheme
0
0.2
0.4
0.6
0.8
1
1.2
10 20 30 40 50
P
r
o
p
o
r
t
i
o
n
X
Our Scheme
Independent
Noise Scheme
International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 9 Sep 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page3353


Fig. 9. The cumulative histogram of the estimation
error when M =10

As shown in the above figure 9 represents the
horizontal axis represents X while vertical axis
represents Proportion.

Fig. 10. The corresponding histogram of the
estimation error when M =20
As shown in the above figure 10 represents the
horizontal axis represents X while vertical axis
represents Proportion.
V. CONCLUSION
In this paper we proposed a new approach for privacy
preserving data mining. The existing perturbation
based PPDM models assume single level trust on
data miners. In this work we focus on multilevel trust
based PPDM which provides more flexibility to data
owner in choosing the level of privacy to data. A
challenge in doing so is that malicious data miners
can use multiple copies of perturbed data in order to
establish the original data. This kind of attack is
prevented by using noise correlation matrix acrossthe
copies to deny the attackers not to have diversity
option. We build a prototype application that
demonstrates the proof of concept pertaining to
multi-level trust based PPDM. The empirical results
revealed that the proposed solution is robust and
effective.

REFERENCES
[1] S. Papadimitriou, F. Li, G. Kollios, and P.S. Yu, Time Series
Compressibility and Privacy, Proc. 33rd Intl Conf. Very Large
Data Bases (VLDB 07), 2007.
[2] K. Liu, H. Kargupta, and J . Ryan, RandomProjection-Based
Multiplicative Data Perturbation for Privacy Preserving Distributed
Data Mining, IEEE Trans. Knowledge and Data Eng., vol. 18, no.
1, pp. 92-106, J an. 2006.
[3] F. Li, J . Sun, S. Papadimitriou, G. Mihaila, and I. Stanoi,
Hiding in the Crowd: Privacy Preservation on Evolving Streams
Through Correlation Tracking, Proc. IEEE 23rd Intl Conf. Data
Eng. (ICDE), 2007.
[4] Z. Huang, W. Du, and B. Chen, Deriving Private Information
From Randomized Data, Proc. ACM SIGMOD Intl Conf.
Management of Data (SIGMOD), 2005.
[5] D. Agrawal and C.C. Aggarwal, On the Design and
Quantification of Privacy Preserving Data Mining Algorithms,
Proc. 20
th
ACM SIGMOD-SIGACT-SIGART Symp. Principles of
Database Systems (PODS 01), pp. 247-255, May 2001.
[6] Enabling Multilevel Trust in Privacy Preserving Data Mining ,
Yaping Li, Minghua Chen, Qiwei Li, and Wei Zhang.
[7] Y. Lindell and B. Pinkas, Privacy Preserving Data Mining,
Proc. Intl Cryptology Conf. (CRYPTO), 2000.
[8] R. Agrawal and R. Srikant, Privacy Preserving Data Mining,
Proc. ACM SIGMOD Intl Conf. Management of Data (SIGMOD
00), 2000.
[9] J . Vaidya and C.W. Clifton, Privacy Preserving Association
Rule Mining in Vertically Partitioned Data, Proc. ACM SIGKDD
Intl Conf. Knowledge Discovery and Data Mining, 2002.
[10] O. Goldreich, Secure Multi-Party Computation, Final
(incomplete) draft, version 1.4, 2002.
[11] J . Vaidya and C. Clifton, Privacy-Preserving K-Means
Clustering over Vertically Partitioned Data, Proc. ACM SIGKDD
Intl Conf. Knowledge Discovery and Data Mining, 2003.
[12] A.W.-C. Fu, R.C.-W. Wong, and K. Wang, Privacy-
Preserving Frequent Pattern Mining across Private Databases,
Proc. IEEE Fifth Intl Conf. Data Mining, 2005.
[13] C.C. Aggarwal and P.S. Yu, A Condensation Approach to
Privacy Preserving Data Mining, Proc. Intl Conf. Extending
Database Technology (EDBT), 2004.
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
10 20 30 40 50
C
u
m
u
l
a
t
i
v
e

P
r
o
p
o
r
t
i
o
n
X
Our
Scheme
Indepen
dent
Noise
Scheme
0
0.2
0.4
0.6
0.8
1
1.2
10 20 30 40 50
P
r
o
p
o
r
t
i
o
n
X
Our
Scheme
Independe
nt Noise
Scheme
International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 9 Sep 2013
ISSN: 2231-2803 http://www.ijcttjournal.org Page3354

[14] L. Sweeney, K-Anonymity: A Model for Protecting
Privacy, Intl J . Uncertainty, Fuzziness and Knowledge-Based
Systems (IJ UFKS), vol. 10, pp. 557-570, 2002.
[15] A. Machanavajjhala, J . Gehrke, D. Kifer, and M.
Venkitasubramaniam, L-Diversity: Privacy Beyond K-
Anonymity, Proc. Intl Conf. Data Eng., 2006.
[16] D. Kifer and J .E. Gehrke, Injecting Utility Into Anonymized
Datasets, Proc. ACM SIGMOD Intl Conf. Management of Data,
2006.
[17] E. Bertino, B.C. Ooi, Y. Yang, and R.H. Deng, Privacy and
Ownership Preserving of Outsourced Medical Data, Proc. 21
st

Intl Conf. Data Eng. (ICDE), 2005.
[18] R. Agrawal, R. Srikant, and D. Thomas, Privacy Preserving
OLAP, Proc. ACM SIGMOD Intl Conf. Management of Data,
2005.
[19] W. Du and Z. Zhan, Using Randomized Response
Techniques for Privacy-Preserving Data Mining, Proc. ACM
SIGKDD Intl Conf. Knowledge Discovery and Data Mining,
2003.
[20] A. Evfimievski, R. Srikant, R. Agrawal, and J . Gehrke,
Privacy Preserving Mining of Association Rules, Proc. ACM
SIGKDD Intl Conf. Knowledge Discovery and Data Mining,
2002.
[21] H. Kargupta, S. Datta, Q. Wang, and K. Sivakumar, On the
Privacy Preserving Properties of Random Data Perturbation
Techniques, Proc. IEEE Third Intl Conf. Data Mining, 2003.
[22] K. Chen and L. Liu, Privacy Preserving Data Classification
with Rotation Perturbation, Proc. IEEE Fifth Intl Conf. Data
Mining, 2005.
[23] R. Agrawal, A. Evfimievski, and R. Srikant, Information
Sharing across Private Databases, Proc. ACM SIGMOD Intl
Conf. Management of Data, 2003.
[24] R. Agrawal, D. Asonov, M. Kantarcioglu, and Y. Li,
Sovereign J oins, Proc. 22nd Intl Conf. Data Eng. (ICDE 06),
2006.
[25] L. Kissner and D. Song, Privacy-Preserving Set Operations,
Proc. Intl Cryptology Conf. (CRYPTO), 2005.
[26] J . Byun, Y. Sohn, E. Bertino, and N. Li, Secure
Anonymization for Incremental Datasets, Proc. Third VLDB
Workshop Secure Data Management, 2006.
[27] X. Xiao and Y. Tao, M-Invariance: Towards Privacy
Preserving Re-Publication of Dynamic Datasets, Proc. ACM
SIGMOD Intl Conf. Management of Data, 2007.
[28] G. Wang, Z. Zhu, W. Du, and Z. Teng, Inference Analysis in
Privacy-Preserving Data Re-Publishing, Proc. Intl Conf. Data
Mining, 2008.
[29] D. Knuth, The Art of Computer Programming: Seminumerical
Algorithms, vol. 2, ch. 3, Addison-Wesley, 1981.
AUTHORS
Sailaja Returi She is pursuing M.Tech (CSE)in
MLRIT , Hyderabad, AP, INDIA. She has received
B.Tech Degree in Computer Science and
Engineering. Her main research interest includes
Datamining and Neworking.

P.Dayaker He is currently with the Department of
Computer Science and Engineering, MLRIT, Andhra
Pradesh, India. He is having 7 years of teaching experience.
His main research interest includes Cloud Computing and
Data Mining.

S-ar putea să vă placă și