Sunteți pe pagina 1din 6

International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6 ISSN: 1837-7823

Add-in Macros for Privacy-preserving Distributed Logrank Test Computation*

Yu Li 1 and Sheng Zhong 1 Department of Computer Science and Engineering, the State University of New York at Buffalo, Buffalo, NY USA 14260

Abstract

Survival analysis is frequently used for dealing with survival outcomes in biological organisms. However it is a tedious process to compare survival curves step by step. In this study, we designed and developed a user- friendly, cloud based Microsoft Excel privacy-preserving program, named Scorpio, for incorporation of electronic health care using privacy preserving logrank test model.

Keywords: Survival curves, Logrank test, Privacy preserving, Excel

1. Introduction

In modern society, people care about their privacy issues increasingly more with the development of information technology. In hospital, patients will have their own medical records stored in the computer, so that biomedical scientists can use this information to do some research. These records will include the medical history of patients such as laboratory test results and medications prescribed. In order to prevent the leak of personal electronic health record, the federal Health Insurance Portability and Accountability Act (HIPAA) has set a national standard to protect privacy of this kind of information. Since the explosive growth of medical research in recent years, biomedical scientists have come up with the idea of using these electronic medical data for incorporate research. However, the privacy and security issue still has been the most concerned thing that impedes such kind of incorporate research. For this reason, with the development of information and cryptograph technology, there is a trend that using computer methods and programs to help medical scientists to solve the privacy issue without revealing patients’ information to others. Survival analysis is also called time to event analysis. Survival analysis is very useful for studying different kinds of event like disease onset, earthquakes, stock market crash [1]. Survival analysis can be used to predict after observing a set of individuals at some specifically time point and continuous monitoring them for fixed intervals of time. Therefore, how to build a survival analysis model is the most critical component to get a better prediction. In biomedical field, survival analysis mainly means observing time to death of experimental subject. Obviously, If having more experiment data that we used for training we can get a more precise model. Therefore, biomedical researchers want to combine the data from different institutes to build a better survival analysis model, especially survival function comparison models [2]. For the privacy and security issues, computer scientist can use privacy preserving method to protect the data from revealing to anyone. In order to compare the survival curves without revealing the data, [2] has come up with a privacy preserving model that can protect the data privacy. However it is a tedious process to compare survival curves step by step. In medical area, Microsoft Excel is widely used due to its friendly user-interface and easy operation. Compared with other statistical computing softwares like SAS and SPSS etc, although most of these softwares have a strong data management ability, the usage of them will be complicated for medical people who has not been training professionally. Microsoft Excel has been widely applied in Medical institutes no matter it is used for store experimental data or creates survival curves. It can help medical scientists to analyse and make better decisions. Besides these, Microsoft Excel has a strong ability to let VBA (Visual Basic for Applications) or Macro develop programs to control Excel. Therefore, most of biomedical scientists are more willing to use Microsoft Excel to store the data that get from the experiment. Consequently, many scientists have developed programs which can apply to Microsoft Excel immediately and automatically. In [3], Hitoshi Sato presented a package of macro programs (named PK MOMENT) to automatically calculate non-compartmental pharmacokinetic parameters on Microsoft Excel spreadsheet. In [4], Zhang presents PKSolver, a freely available menu-driven add-in program for Microsoft Excel written in Visual Basic for Applications (VBA), for solving basic problems in pharmacokinetic (PK) and pharmacodynamic (PD) data analysis. In [5], Brown presented a simple, easily understood methodology for solving biologically based models using a Microsoft Excel spreadsheet. In [6], a user-friendly, inexpensive EXCEL-based program to find potential phosphorylation sites in proteins is presented. In this paper, we develop a user-friendly, cloud based Microsoft Excel privacy-preserving program, named Scorpio, for incorporation of electronic health care using privacy preserving logrank test model. Since the

* This paper was supported by NSF CNS-0845149 and CCF-0915374. Part of the results were presented at [UNESST 2012]

International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6 ISSN: 1837-7823

program does not require any programming skills or any use of VBA or Macro language. Once the data from all institutes are ready, the program can be run automatically. In the rest of this paper, we describe the method of creating privacy preserving comparison test of survival curves, especially data store and collection method as well as the design and implementation of our program.

2. Methods

Logrank test is a standard comparison test of survival curves. When a research institute wants to raise a computation for logrank test, he needs to collect data from different medical institutes. However, some medical data are very sensitive. How to compute the logrank test without revealing these data to other people who does not own is a big issue. In [2], the authors have come up with a privacy preserving secure sum method which generate an initial random number and add it to the first medical institute’s data. Here, we introduce their method briefly. They suppose there are n groups of individuals.

briefly. They suppose there are n groups of individuals. Table 1: Summary of Denotations for Logrank
briefly. They suppose there are n groups of individuals. Table 1: Summary of Denotations for Logrank
briefly. They suppose there are n groups of individuals. Table 1: Summary of Denotations for Logrank

Table 1: Summary of Denotations for Logrank Test

Table 1: Summary of Denotations for Logrank Test : the number of individuals that are alive

: the number of individuals that are alive in group k at the beginning of time interval j.

are alive in group k at the beginning of time interval j. : the number of

: the number of events occurring in group k in interval j.

: the number of observed deaths in group k.

: expected number of deaths in group k.

The final Z is the logrank test result. A smaller Z indicates that the hypothesis has a higher probability that is true. In [2], the authors assume there are s parties (s > 3) involved in this logrank test computation. They provided a privacy preserving method that let the first institute who participate this survival analysis computation add a random number to its data. The range of the random number should as same as and . Then pass it to the next participant. Similarly, every other participant adds its local value to the sums that it receives and sends the new sums to the next party. Finally, the first institute can get the sum and calculate the logrank test with the random number he already knows. In this process, actual values of and are hidden behind the random numbers [2]. Based on this privacy preserving model, we design a program that can automatically collect data from each participate medical institute and add these data to the initial file immediately. After collecting all the data, the program then calculate the quotient of the number of events occurring divided by the number of individuals that are alive in each interval. Then each medical institute can get the value automatically. After that each institute can calculate the expected number of deaths and logrank test statistic automatically. Then we let the program repeat the method again that add another random number to the first medical institutes logrank test result and add up all these result. Then first institute who rise up the comparison can get the final logrank test statistic and inform all other participants.

37

first institute who rise up the comparison can get the final logrank test statistic and inform
first institute who rise up the comparison can get the final logrank test statistic and inform
first institute who rise up the comparison can get the final logrank test statistic and inform
first institute who rise up the comparison can get the final logrank test statistic and inform

International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6 ISSN: 1837-7823

Specifically we use cloud-based storage to collect the data from each institute. Cloud-based storage can let everybody who has the permission reach the file from anywhere. In this part, as shown in figure 1, we first let party 1 add a random number on its data and upload the file into the server, then party 2 download the file and add its own data on the existing data, then upload the file to the server. Go on like this until the last party done. Therefore the first party can get the sum of actual data after minus the random number. After that program can automatically call Microsoft Excel Macro we developed to calculate the value we need. After that party 1 can get the final logrank test statistic result and let other participated institutes know.

3. Program Description

3.1 Software Design

The program is developed using C# combined with Microsoft Excel VBA which is universal available and very convenient in Bio-medical research institute. We assume every medical institute uses Microsoft Excel to store the survival data. In order to protect privacy of these survival data, our program add a random number to the original data of first institute. Then the first institute raises the requirement of computation for the survival curve comparing logrank test. Our program will automatically upload the file to the server and add other institutesdata to the existing data. Therefore, the institute participated the computation will not know otherssurvival data. Although this can be done manually, it will be very tedious and waste a lot of time to click the button when calculate the value using Excel. However our program can easily read the input file and calculate the logrank survival comparison automatically without revealing data to others.

Figure 1: The flow chart of our program

data to others. Figure 1: The flow chart of our program 3.2 How to use Scorpio

3.2 How to use Scorpio

After one institute sets up a server that use for store the file, the institute who wants to participate the logrank test calculation runs the program we developed as shown in figure 2. First, every institute should connect to the server. Then one medical institute who wants to raise the calculation uploads their files that has added a random number on the data, and chooses the participant and click the send button. Then each participant will receive a message in turn. After that the program will download the file and add their data on the previous data in the file and upload it. After all participants finishing adding their data, the first party can get the whole data with the random number he added.

38

International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6 ISSN: 1837-7823

Figure 2: The program user interface for privacy preserving logrank test

program user interface for privacy preserving logrank test 3.3 Computation of survival curves comparing using logrank

3.3 Computation of survival curves comparing using logrank test

For the computation of survival curve comparing using logrank test, after the program collecting the data from all medical institutes, the program minus the random number which has been added to the original data of first institute and get the whole alive and death number of every intervals. The program then calculates the summation of all alive and death number respectively.

3.4 Program Code

Here we list some Excel Micro we developed in our program.

Add Random Number

Sub AddRandomNumber()

Range("E1").Select

ActiveCell.FormulaR1C1 = "=RANDBETWEEN(1,20)"

Range("F1").Select

ActiveCell.FormulaR1C1 = "Random Number for d"

Range("E1").Select

Selection.Copy

Range("F2:F12").Select

Selection.PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks _

:=False, Transpose:=False

Range("E1").Select

Application.CutCopyMode = False

ActiveCell.FormulaR1C1 = ""

Range("E1").Select

ActiveCell.FormulaR1C1 = "dj+RND"

Range("E2").Select

ActiveCell.FormulaR1C1 = "=RC[-3]+RC[1]"

Range("E2").Select

Selection.AutoFill Destination:=Range("E2:E12"), Type:=xlFillDefault

Range("E2:E12").Select

Range("H1").Select

39

International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6 ISSN: 1837-7823

ActiveCell.FormulaR1C1 = "=RANDBETWEEN(1,20)"

Range("I1").Select

ActiveCell.FormulaR1C1 = "Random Number for n"

Range("H1").Select

Selection.Copy

Range("I2:I12").Select

Selection.PasteSpecial Paste:=xlPasteValues, Operation:=xlNone, SkipBlanks

:=False, Transpose:=False

Range("H2").Select

Application.CutCopyMode = False

ActiveCell.FormulaR1C1 = ""

Range("H1").Select

ActiveCell.FormulaR1C1 = "nj+RNN"

Range("H2").Select

ActiveCell.FormulaR1C1 = "=RC[-5]+RC[1]"

Selection.AutoFill Destination:=Range("H2:H12"), Type:=xlFillDefault

End Sub

Compute Ek

Sub ComputeE()

Range("M1").Select

Application.CutCopyMode = False

ActiveCell.FormulaR1C1 = "E"

Range("M2").Select

ActiveCell.FormulaR1C1 = "=R[-1]C[-10]*R[-1]C[-2]"

Range("M2").Select

ActiveCell.FormulaR1C1 = "=RC[-10]*RC[-2]"

Range("M2").Select

Selection.AutoFill Destination:=Range("M2:M12"), Type:=xlFillDefault

End Sub

4. Samples of Program Runs

The medical scientists usually prefer to use Microsoft Excel to store the data that gets from experiment. They also care about the privacy issue when they want to combine the data from different medical institute to do some research. The Scorpio program is specially designed for medical scientists to combine their survival data to generate comparing survival curves using logrank test. The input data is as figure 3 shows. The medical scientists just only need to type the alive and death number into different time intervals. After the program collect all required data from other institutes, the first party use the macro we provide can get the final logrank test statistic result as figure 4 shows.

40

International Journal of Computational Intelligence and Information Security, June 2013 Vol. 4, No. 6 ISSN: 1837-7823

Figure 3: Original data owned by each institute which should be keep confidential from revealing to other parties

should be keep confidential from revealing to other parties Figure 4: The final result of privacy-preserving

Figure 4: The final result of privacy-preserving logrank test statistic

final result of privacy-preserving logrank test statistic 5. Hard ware and software specifications An Intel CORE

5. Hard ware and software specifications

An Intel CORE i5 computer (2GB RAM) running under windows 7 operating system was used. The program was developed using Microsoft Excels macro language in Excel 2010 platform.

6. Conclusion

In this paper, we have designed a Microsoft Excel Macro based privacy preserving program for survival curves comparison using logrank test. In order to make it easy to use and protect the data privacy, the program can be applied to Microsoft Excel immediately which is widely used by clinics and biomedical scientists. The program also can protect privacy of the data by adding random number to the original data. Experiments on the real medical data have shown the effectiveness of our proposed program.

References

[1] Allison, P.D. (2010) Survival analysis using SAS: A practical guide, SAS publishing. [2] Chen, T. and Zhong, S (2011) Privacy-Preserving Models for Comparing Survival Curves Using the Logrank Test, Computer methods and programs in biomedicine. [3] Sato, H. and Sato, S. and Wang, Y.M. and Horikoshi, I. (1996) Add-in macros for rapid and versatile calculation of non-compartmental pharmacokinetic parameters on Microsoft Excel spreadsheets., Computer methods and programs in biomedicine.50,1,43-52. [4] Zhang, Y. and Huo, M. and Zhou, J. and Xie, S.(2010) PKSolver: An add-in program for pharmacokinetic and pharmacodynamic data analysis in Microsoft Excel. Computer methods and programs in biomedicine.

99,3,306-314.

[5] Brown, M. (1999) A methodology for simulating biological systems using Microsoft Excel. Computer methods and programs in biomedicine. 58,2,181-190 [6] Wera, S. (1998): An EXCEL-based method to search for potential Ser/Thr-phosphorylation sites in proteins. Computer methods and programs in biomedicine. 58,1,65-68 [7] Li, Y and Zhong, S. (2012) Scorpio: A simple, convenient, Microsoft Excel Macro based program for privacy-preserving logrank test. Computer Applications for Database, Education, and Ubiquitous Computing.

86-91

41