Sunteți pe pagina 1din 6

2013 Interational Conference on Circuits, Power and Computing Technologies [ICCPCT-2013]

A New Model for Privacy Preserving Multipart


Collaborative Data Mining
SBhanumathi,
Research Sholar
Department of Computer Science and Engineering
Sathyabama Universit
Chennai-600119, India
banujun8@gmail. com
Abstract- Due to the increasing use of internet, the privacy of
sensitive data in multiparty collaborative mining is a major issue.
The group of participants contribute their own data sets and
collaboratively involved to fnd quality model in multiparty
collaborative mining. In this approach, each participant has
sensitive and non-sensitive data in their local database.
Therefore, an important challenge of privacy preserving
collaborative data mining (PPCDM) is how multiple parties
efciently conduct data mining without exposing each
participant's sensitive information. This paper proposes a new
Binary Integer Programming model for multiparty collaborative
data mining, which provide solutions to investigated problem of
disclosure of sensitive data. In addition to that, maintaining
confdentiality of the newly created pooled data by semantically
secured EIGamal Encryption Scheme. Finally, Artifcial Neural
Network is used by the service provider in order to predict the
patterns for data providers to identify the risk factors of
colorectal cancer.
Keywords- Collaborative Data Mining, Privacy Preservation,
Binar Integer Programming Model, Articial Neural Network,
E1Gamal Encrtion Scheme.
I. INTRODUCTION
In privacy preserving data mining, there is a need to extract
knowledge fom databases without disclosing information
about specifc individuals [7], [16], [17]. It is particularly true
when data is shared across organizational boundaries
[4].When common users are involved in data mining, all of
them want to transfer their data to trusted common centre to
conduct the mining; however, it is very difcult for a user to
trust the other users. In this situation privacy of user data must
be a great concer. This process is called Privacy Preserving
Collaborative Data Mining. For example, cancer research
institutes in different geographical areas have to
collaboratively discover the ecological factors correlated to
certain type of cancer.
The gap between the data mining and data confdentiality is
flled up by the privacy preserving data mining [4]. In recent
development in hardware technology-commerce and intern et
has promoted the privacy technologies to speedup of research
on privacy-preserving collaborative data mining and also this
978-1-4673-4922-2/13/$31.00 2013 IEEE 845
Sakthivel
Associate Professor
Department of Electronics and Communication Engineering
Anna Universit
Chennai-600025, India
psv@annauniv. edu
problem has been attracted by several researchers. Many
approaches have been applied to solve this kind of problem
such as secure multiparty computation [I], randomization
techniques [18] and geometric perturbation [5]. However
secure multiparty computation is not scalable and it takes
more computation and communication cost[22]. In [5], [21],
Geometric perturbation is affected by three categories of
attacks like, naive estimation attack, distance inference attack,
independent component attack.
The data miner may be able to produce a more accurate
reconstruction of the original data by utilizing diversity across
differently perturbed copies. This kind of attack is called as
diversity attack [24]. Menon et.al in [20] discussed an integer
programming optimization algorithm for hiding sensitive
itemsets while reducing number of modifed transaction. In
[20], an integer programming approach is used to hide
sensitive fequent itemsets instead of hiding produced rules
directly . . Many popular data mining algorithms (classifcation
[5], association rule mining [3], [4] and clustering [6]) have
been used to predict new model by data miner. In this paper,
the data mining based on the artifcial neural network is
researched in detail.
The rest of the paper is organized as follows: We go over
preliminaries in section 11 for better understanding of the
paper. We will briefy review the collaborative famework in
section Ill. In section IV, we present the mathematical model
for privacy preservation. We discuss the EIGamal algorithm in
section V. In section VI, we present methodology for Artifcial
Neural Network. Finally, Section VII concludes this paper.
11. PRELIMINARIES
A. Privacy Preserving Collaborative Data Mining
Multiple parties need to send their data to a trusted central
place (super-computing center) to conduct the mining process
using the existing data mining algorithms. However, privacy
concers with this situation, the parties may not trust anybody.
This type of problem is called as Privacy Preserving
Collaborative Data Mining (PPCDM) problem. Success of
2013 Interational Conference on Circuits, Power and Computing Technologies [ICCPCT-2013]
business is no longer the result of individual toiling isolation
rather it depends upon team efforts, partnership and
collaboration etc. Sometimes such collaboration becomes
important because of the mutual benefts it brings [2]. For the
king of collaboration, the data privacy becomes extremely
important.
Privacy Preserving Collaborative Data Mining is a data
mining attempt that is distributed to several collaborative
agents that may be a human or sofware. As several agents are
involved, preserving privacy and sending data without
compromising user privacy is great concer. The techniques
for performing privacy-preserving data mining are drawn fom
information hiding, data mmmg and cryptography.
Collaboration is an activity conducted by two or more parties
to accomplish a general goal.
Collaborations are seen in many areas such as educational
organization, health care institution, detective agencies etc.
whereby one party hides another parties services to achieve a
particular task. For example, law enforcement agencies
distributing information with each other to identif and trace
potential criminals or health related agencies exchanging
patient clinical records to make available better health care
facilities. However, these collaborations are typically narrow
in scope due to the privacy and security concers of the
individual organizations. The techniques for performing
privacy-preserving data mining are drawn from Data hiding,
data mining and cryptography. In this paper, all these three
techniques are applied to generate new model with data
protection.
B. Binary Integer Programming Model
In linear programming solutions, variables are factional.
This factional solution will not be realistic in many cases. So
the optimization problem must be considered as
Subject to:

Mmr,
*
. 1
Xj are binary variables, where G = 1,2, ... n).
This problem is called as Binary integer programming
problem. When all the decision variables must be an integer,
then it is called as pure integer progring. When some, but
not all variables are restricted to be integer is called as
mixed integer programming model. Binary Integer
programming model has played an important role in
supporting managerial decision with various constraints.
Binary integer programming models have variety of
846
applications including: capital budgeting model, facilities
location, 0-1 knapsack problem, airline crew scheduling and
warehouse location. Binary variables are of great importance
because they occur regularly in many model formulations,
particularly in problems addressing long-range and high-cost
strategic decisions associated with capital-investment
planning.
C. Semantic Securit
Semantic security of encryption system [3] can be defned
as follows:
Definition: An encrption system (G, E, D) is said to be
semantically secure i for ever probabilit distribution, g ,
for ever polynomial function, h, ever semantic function, f
and for ever probabilit polynomial time algorithm, A, there
exists a probabilistic polynomial time algorithm, A', such that
for ever constant c > 0, and suf ficiently large n (the size of
messages),
Pr [A (E C(1")(m), h(m), r) = f (m))]
-Pr [A' (h (m), r) = f (m)] < I1nc
where the probabilit Pr [A(E G(Jn)(m), h(m), r ) = f (m))] is
taken over the coin tosses of the algorithms A, E and G and
the probabilit distribution of message (g ), the probabilit
Pr[A '(h(m), r ) =f (m)] is taken over the coin tosses of A' and
the probabilit distribution ofmessages (g ).
Instinctively, the notion of semantic security [3], [I S] says that
anything can be efciently computed fom the ciphertext and
additional partial information on the plaintext can be
efciently computed only the given length of plain text and the
some partial information. This notion informally tells that a
ciphertext does not reveal any usefl information about the
plaintext, excluding its length, to a polynomial-time attacker.
This security notion becomes a standard requirement in the
design of novel cryptosystems.
D. Artiicial Neural Network (ANN)
Artifcial Neural Network is usually called as Neural
Networks (NN). ANN has the personality of distributed
information storage, self organization learing and parallel
processing. It can fnd solution for several problems which is
diffcult to solve by other methods. Neural network is
comprised of three phases [8]: the model, the learing
algorithms and activation fnction.
Its model can be classifed into three types. They are
Feedback Network, Self Organization Networks and Feed
Forward networks. The NN method is used for clustering,
classifcation, patter recognition and prediction in data
mmmg. It's an efcient an approach for developing
mathematical model or computational model.
2013 Interational Conference on Circuits, Power and Computing Technologies [ICCPCT-2013]
IDI IOOei D[
Fig l. Artifcial Neural Network
A neural network consists of an interconnected group of
nodes. It has artifcial neurons and it processes information
using a connectionist approach for computation. In several
cases AN is an adaptive system that changes its structure
based on exteral or interal information that fows through
the network during the learing phase. Moder neural
networks are non-linear statistical data modelling tools. They
are usually used to model complex relationships between
inputs and outputs. It is shown in fg.I.1t can have remarkable
ability to derive meaning fom complicated data and can also
be extract patters.
Ill. FRAMEWORK FOR MULTIPARTY COLLABORATIVE DATA
MINING
Multiple parties are interested to use data mining services
for fnding universal models. In collaborative mining, the
private data of each participating data providers must be
preserved and quality of mined models must also be
maintained. The most popular solution for data distribution of
collaborative data mining is service-oriented infastructure.
There are two parties involved in the computing. They are a
data provider who is going to distribute their data and service
provider who has more computing power to provide services
to data providers. The participants (data providers) share their
local information themselves without disclosing their sensitive
data to each other. To reach this goal, Binary Integer
Programming model is used. Data mining technique is
associated with service provider (the data miner) for mining
interested models. Service oriented famework is shown in
fg.2.
bOtlrH1tr
Fig2. Proposed Collaborative Framework for PPCDM
847
In information security model, the threats may occur by
two kinds of attacker. They are internal attackers and exteral
attackers. Here, the collaborative parties are called as internal
attackers. The network attackers are called as exteral
attackers [3]. In this paper, Binary Integer programming
model is applied on each participant's data to prevent identity
of sensitive data. Cryptography technique is used to protect
data fom other network attackers.
IV. PROPOSED MATHEMATICAL MODEL FOR
COLLABORATIVE DATA MINING
Our proposed method attempts to hide selected sensitive
data of data provider's database fom other data providers.
The modifed version of the database DB' will have same size
as the original database DB. Therefore 10BI=IOB'I. Our target
for hiding sensitive attributes values can be represented as:
For all Ai A
Where A-Set of attributes Ai-Set of sensitive attributes
Consider a set of variables {T1. Tz .... ,Tn } and set of
constraints {C1.CZ ..... Cn} where each variable Ti has a non
empty domain E. An assignment which does not violate the set
of constraint is said to be consistent. The problem can be
solved in linear and non-linear manner. Our formulations
enable us to solve the sanitization problem in E and are
capable of identifing the solution. The model is
Subjects to the constraints
y H < UUllD
.. ] eE{.} _ eT 1. (l )
u Om11
(2)
L..
.
ED{T} [JER 1J
Here Uij denotes the bitmap value. These Uij values need to
be adjusted consequently to permit us to hide sensitive
attributes while yielding a result with the least effect on the
rest of the database.
V. SECURED ELGAMAL ENCRYFTION SCHEME
The semantically secured version of the EIGamal
cryptosystem [3], [12] allows communicating parties to
exchange encrypted messages over unsecure networks. This
encryption system contains three components [10, I I]: the key
generation, the encryption algorithm and the decryption
algorithm. The security of the EIGamal cryptosystem is based
on the reliable intractability for underlying the security of
2013 Interational Conference on Circuits, Power and Computing Technologies [ICCPCT-2013]
public key cryptosystem. It's also used for solving the discrete
logarithm problem and the Dife-Hellman problem. Here data
providers and service provider send and receive secured
message by EIGamal encryption scheme over un secure
network.
1. Key Generation:
The key generator works as follows:
l. Find a random number Q.
2. Check P=2Q* I is prime or not,
go to step 1 ,if P is not prime
3. Select a random generator h.
Set g=h
2
mod P
4. Let desc(G) be such that G=<g>
5. Let G be the plaintext message space.
6. Choose a random number x fom {l, 2, ... , p-l} as a
private key
7. Calculate public key by y=gX (mod p)
8. Publish (G,P,g,y) as public key and private key x
must be secret.
2. Encryption:
The encryption algorithm (E) works as follows: sender
en crypt a message m to receiver under its public key
(G,P,g,y).
I. Sender converts m (plaintext) into an element of G.
2. Select a random number k fom {l, 2, ... , p-l}
3. Calculate ciphertext (Yl, Y2) as follows:
E ((G,P,g,y),m)=(g
k
(mod p) m

(mod p))
i.e
Yl=g
k
(mod p) Y2=

m (mod p)
4. Sender sends ciphertext (Y
b
yz) to receiver.
3. Decryption:
The decryption algorithm (D) works as follows: Receiver
can decrypt the cipher text with private key x.
Receivers decryption procedure will actually return the same
plaintext message that sender has encrypted.
Since
The decryption calculation restores the plaintext m.
VI. MODEL PREDICTION BY ANN
A. Data Mining Process Based on Neural Netork
The data mining based on neural network is composed of three
phases [9]: data preparation, rules extracting and rules
assessment, as shown in Fig.3.Data preparation phase includes
data cleansing, data option, data preprocessing and data
848
expression. Rules extraction is a method of extracting rules
fom recursive network. Then, the rules can be assessed with
several objectives. They fnd the optimal sequence of
extracting rules, test the accuracy of the rules extracted, detect
how much knowledge in the neural network has not been
extracted and detect the inconsistency between the extracted
rules and the trained neural network.
Urfl Ru es
Ru Alessmellt
Fig.3. Data Mining Process based on Neural Network
B. Artificial Neural Netork in Colorectal Cancer Prediction
ANN is a branch of computational intelligence that
employs a multiplicity of optimization tools to gain
knowledge fom past experiences and use this prior training to
predict and recognize new patters. In [14], Artifcial Neural
Network is used to diagnosis of thyroid disorders. Here, these
neural network models have been used for the prediction of
risk factors in Colorectal Cancer. Colon consist four sections:
transverse colon, ascending colon, descending colon and
sigmoid colon. Colorectal cancer is most common cancer that
developing in the colon or the rectum. The third leading cause
of cancer death is colorectal cancer.
TALE!. GENERAL CHARACTERISTIC OF THE PATIENTS WITH
COLORECT AL CANCER
\.r. Ne . re' Colol ,,;mc
Etlmil)'
Hhrsk
bl \or
Typ offrr
latmlt
P. thQl
s(ae
Mae
Female
fme
Ollxr
Peial
Az
Ku
lu:
O!I_t
Has
H S'.
Snllery
Ch"llllhapy
tclior1P
Biopy
Pm,u
Ad'd
9:.5
95 -8._
3 13
}S 9.1
,_
5
L
5 2,
S 4:5
11
-
.5
18 91.
lO -.9
.4
101 "L
9S 4S.
t
6]6
32 38'
69 9 . 8
S7
]9
1 t _.6
5: 6.9
50 69
S- 11.3
'59
.6 63.l
"6 3.6
9 1..6
103 13.8
, S 1
,- i
2013 Interational Conference on Circuits, Power and Computing Technologies [ICCPCT-2013]
In [13], a general characteristic of the patients with colorectal
cancer is listed. It is shown in Table.!.
Artifcial Neural Network is used to simulate human brain
fnctions. It is composed of parallel computing units called
Neurons. These neurons can be linked in several ways to form
dissimilar Neural Network architectures. The most popular
architecture is the Multi-layer Perceptron (MLP). It contains
two or more layers of neuron in which the layers are joined in
sequential manner. Every one of the neuron is connected to
other neurons in the different layer by weighted path ways.
Signals are passed through these pathways to the other
neurons. Each neuron sums the weighted signals and sends
the resulting signal as the output of the neuron by an activation
fnction. Then the output signal is transmitted to the other
neurons in the subsequent layers. The input layer receives
signals fom the data entering into the network. The output
layer generates the outcome to the outside world. The
complete methodology [23] is shown in Fig 4.
Coil ect data set for Col orectal Cncer
Data processi ng
Si mulate ANN mod el s
Trai ni ng and Testi ng phase
Predi ct on and Di agnosis Dec isi on
Performance Co mparison o f AN N mod el s
Fig4. Complete Methodology of AN
In this paper, predictors for ANN are smoking, red meat
eating, obesity, physical activity usage of aspirin and other
medicines respectively. The complete methodology of ANN is
associated with service provider (Data Miner). Accuracy of
ANN is measured by the Root Mean Square Eror (RMSE).
849
H|0000l8y0|80|lw|lO0 M00|lO0. Hyp0mOll0 |80g00I
JIpJ| l8y0|80I|wI|O0 M00IlO0. l000|lIy
5ynapt`cWe`ght>O
5ynapt`cWe`ght<O
Fig 5. AN Network Diagram for colorectal cancer
VII. CONCLUSION
In this paper, we proposed a novel approach to hide
sensitive data in multiparty collaborative mining. The
proposed method, a Binary Integer Programming Model can
be used to mask sensitive information while presenting it on a
publicly accessible platform by collaborative data providers.
Furthermore, semantically secured EIGamal encryption
scheme can be applied to protect confdentiality of the data,
when it's crossed over the network fom data providers to
service provider. Finally, this paper has discussed the
complete methodology of ANN to predict model for colorectal
cancer. As future work, in an experimental way to analyses the
accuracy and effectiveness of this approach towards privacy
and security.
(1|
|J
|`J
|!J
(`J
(J
|J
REFERENCES
Yehuda Lindell and Benny Pinkasy, "Secure Multiparty Computation for
Privacy-Preserving Data Mining," The Joural of Privacy and
Confidentialit (2009), Number I, pp. 59-98, 2009.
1. Breckling, Ed. , "The Analysis of Directional Time Series:
Applications to Wind Speed and Direction," Lecture Notes in Statistics.
Berlin, Germany: Springer,Vot. 61,1989.
Justin Zhal, Sta Matwinl, Nathalie Japkowiczl, LiWu
Chang,"Privacy-Preserving Collaborative Association Rule Mining,"
The Fourth International Conference on Electronic Business
(ICEB2004) / Beiing 2004.
Yasien, A.H., "Preserving privacy in association rule mining," Ph.D
Thesis, Universit of Grifith, 2007.
Keke Chen, and Ling Liu,"Privacy-preserving Multiparty Collaborative
Mining with Geometric Data Perturbation," IEEE Transactions on
Parallel and Distributed Computing, Dec 2009.
1.Vaidya and C.Clifon,"Privacy preserving K-means clustering over
vertically partitioned data," Proceedings ACM SIGKDD International
ConferenceKnowledge Discovery and Data Mining, 2003.
V.Verykios, E.Bertino, LFovino, L.Provenza, y.Saygin and
Y.Theodoridis," State-of-the-art in privacy preserving data mining,"
|'J
,1U|
,11J
,1J
|1`J
,1!J
|1`J
,1J
|1J
,1^J
|1'J
|J
|1J
,J
,'J
|!J
2013 Interational Conference on Circuits, Power and Computing Technologies [ICCPCT-2013]
SIGREC'04: Proceedings of the 2004 ACM SIGMOD Record,
pp.50-57, 2004.
H LU,R Setiono, ad H Liu,"Efctive data mining using neural
network," IEEE Transaction on Knowledge and Data Engineering,1996,
pp.957-961.
Xianjun Ni,"Research of data mining based on neural network," World
Academy of Science, Engineering and Technolog, May 2008.
Justin Zhang,stan Matwin, "A crypto-based approach to privacy
preserving collaborative data mining," Sixth IEEE Interational
Conference on Data Mining-Workshops (ICDMW'06),
Dec 2006, pp.546-550.
Elgamal,T,"A public key cryptosystem and a signature scheme based on
discrete logarithms, Information theory," IEEE Transactions,Vol 31,
No 4,JuI1985,pp.469-472.
Wenbo mao,"Moder Cryptography theor and Practice," pearson
press, 2n
d
edition 2010.
Akbar Biglarial, Enayatollah Bakhshi, Mahmood Reza Gohari ad
Reza Khodabakhshi," Artifcial Neural Network for Prediction of
Distat Metastasis in Colorectal Cancer," Asian Pacific Journal of
Cancer Prevention, Vol 13, pp.927-930, 2012.
T.Z. Tan, C. Quek, G.S. Ng a, E.Y.K. Ng, "A novel cognitive
interpretation of breast cancer thermography with complementary
learning fuzzy neural memory structure," Expert Systems with
Applications, pp. 652-666, 2007.
S. Golwasser and S. Micali,"Probabilistic encryption,"Journal of
Computer and SystemSciences, Vo1.28, pp.270-299, 1984.
C.Clipton, M.Kantarcioglu and J.Vaidya,"Defning privacy for data
mining," WGDM'02: National Science Foundation Workshop on next
generation Data Mining, pp.126-133, 2002.
Vaidya, 1., Clifon, c., & Zhu, M," Privacy Preserving Data Mining,"
New York: Springer, 2006.
W. Du and Z. Zhan, "Using randomized response techniques for
privacy-preserving data mining", In proceedings of the 9th ACM
SIGKDD interational conference on knowledge discover and data
mining, Washington, DC, USA, Aug. 2003,pp.24-27.
Aris Gkoulalas-Divanis Vassilios S. Verykios,"An Integer Programming
Approach for Frequent Itemset Hiding," CIKM'06, November 5-11,
2006, Arlington, Virginia, USA.
S. Menon,S. Sarkar ad S. Mukheee,"Maximizing accuracy of shared
databases when concealing sensitive patterns," Information Systems
Research,pp.256-270,2005.
K.Chen ad L.Liu,"Towards attack-resilient geometric
perturbation," in SIAM Data Mining Conference, 2007.
data
Durgesh Kumar Mishra, Priyanka Jangde and Gajendra Singh
Chandel,"Hybrid Technique for secure sum protocol," World of
Computer Science Information Technolog Joural (WT), Vol.I,
No. 5, pp 198-201, 2011.
S. Bhanumathi and P.Sakhivel,"Privacy Preserving Multiparty
Collaborative Mining using Integer Programming Model", Conference
on Recent Trends in Computer and Networking Technologies, Dec 21-
23'd 2012.
Yaping li,Minghua Chen,Qiwei Li and Wei Zhang,"Enabling multilevel
trust in privacy preserving data mining," IEEE Transaction on
Knowledge and Data Engineering,YoI.24,No.9,Sep 2012.
850

S-ar putea să vă placă și