Smart Phone Data Mining

User Authentication on Smart Phones Using a Data Mining Method
Yujin Tang lerry c@ruri.waseda.jp Nakazato Hidenori nakazato@waseda.jp Yoshiyori Urano urano@waseda.jp
Graduate School of Global Information and Telecommunication Studies, Waseda University, Japan
Abstract
The 21st century witnessed the wide spread of smart phones such as iPhone. The daily growing importance of smart phones also implies the increasing amount of user sensitive data stored in a cell phone, which positions mobile user authentication in an even more important role. Existing mobile user authentication methods either require special hardware or are not user transparent. In this paper, we present a mobile user authentication scheme using a data mining method that identies a user based on cell phones application history and GPS information. These data can be collected on almost every smart phone without user awareness and are prone to reect a users habit and biometric feature. We organize these data in directional graphs and introduce a metric based on which to classify the data. Experiments and results on real data are explained to show our schemes effectiveness.
1. Introduction
Thanks to the advanced mobile telecommunication technologies, smart phones are able to provide much daily convenience. In Japan, for instance, people can read an online book, check their email or surf the Internet to kill their boring time while they are in a train to work; they can nd a restaurant they have never been to with the navigation function; etc. The sequential product release of smart phones such as Apples iPhone and Googles Nexus One proves that smart phones are playing a growing important role in our daily life. But that also implies the increasing amount of user sensitive data stored in a cell phone, e.g, credit card number, email account password, etc. If a cell phone falls into the hand of a malicious user, the loss to the genuine user would far beyond the cost of the device. And this suggests the authentication of mobile users in an even more important position. The dominant method for achieving user authentica-
tion on some mobile devices is the use of 4-8 digit Personal Identication Numbers (PINs) which can be applied to both the device and the users Subscriber Identity Module (SIM)[3], but this technology suffers from two main disadvantages. The obvious one is that users can get annoyed while repeatedly typing passwords, especially for those who have long and complex ones for the sake of security. In order to overcome this, some applications record the password and remain the authentication effective for a certain period of time. But that is where the second disadvantage lies the password mechanism only exams the user at the entrance. Although some methods such as [7] improved the mechanism, the disadvantages mentioned above still remain. Some smart phones provide more advanced mechanisms such as ngerprint scanning and iris scanning. For instance, in 2004 NTT DoCoMos F505i handset was equipped with a built-in ngerprint sensor targeted at protecting its users from device theft. With higher accuracy, though, this method requires the help of special hardware which not every smart phone possesses. Furthermore, ngerprint scanning and iris scanning are considered intrusive[3]. [5] and [6] offered good methods to settle the problem, but keystroke analysis may fail when facing a large number of samples and facial/voice recognition requires considerable amount of computing which is obviously not appropriate on a battery limited mobile device. Despite that the analysis work can be relieved by a remote server, the amount of data to be transferred is still considerable and can take bandwidth. In this paper, we present a mobile user authentication scheme based on a data mining method that utilizes the application history and the GPS information on a smart phone. This information can be collected easily on almost every smart phone without user awareness and is prone to reect a users habit as well as some biometric feature. We organize such data in directional graphs and then identify a user by using a rule-based classier. The training phase takes positive training data only and we introduce a new metric based on which classication decision is made. Our method is user transparent and experiments on real data proved it to
978-0-9564263-8/3/$25.002010 IEEE
173
be effective. Like many other researches on security, our authentication framework may leave us some future work, but what we emphasize in this paper is the utilization of the data mining method. The rest of the paper is organized as following: section 2 gives related work on this topic; section 3 describes our data mining method based authentication, including the system architecture, the organization of data, the generation of classication rules and the rule based classication; experiments and results are presented in section 4; section 5 and 6 conclude our current research and tell about our future work.
can be performed to identify a user when he/she makes a phone call. Although equally accurate, facial recognition may surpass voice recognition if the user tends to make less phone calls. Despite of the fact that most smart phones are equipped with cameras, some may contain no user-facing cameras, which limits the adoption of this method. Furthermore, both facial recognition and voice recognition need considerable computing power and are not considered appropriate on battery limited mobile devices. Even with the help of some remote analysis servers, the size of data to be transferred can take a considerable amount of bandwidth.
2.4. Keystroke Analysis
2. Related Work
This section gives some related work regarding user authentication on smart phones without the help of special hardwares.
2.1. Personal Identication Number

PIN is a secret numeric password shared between a user and a system that can be used to authenticate the user to the system. Typically, the user is required to provide a noncondential user identier or token and a condential PIN to gain access to the system[8]. Users can get annoyed when repeatedly asked to enter passwords and behaviors such as clearly documenting the passwords, selection of guessable password as well as sharing passwords with other people expose data security to risks[3][4].
Keystroke analysis outperforms facial/voice recognition in that its required computing power is relatively less while still remaining user biometric features. [5] tells details of this method and also shows satisfying results. However, since [5] considers keystroke latency and hold-time characteristic only, with the presence of a larger number of samples there would probably exist multiple people with close typing habits and thus confusing the system.
3. Data mining based authentication

Clearly, user authentication on mobile devices is a tough task. Without the presence of additional hardware, it requires high accuracy but low power consumption. And the trend is to grasp data that can reveal a users biometric feature. Our scheme collects users routine data such as locations being in everyday and application history, then performs classication to nd out personal habits upon which authentication is based. The result from classication is either positive or negative, with the former indicating a genuine user and the latter for a false one. For the sake of security, it combines 2.1 or 2.2 in the case of a negative result. This section tells detail of our scheme.
2.2. Image-Based Password Authentication

Instead of using a text-based password, image-based password authentication method uses photographic images taken by users as passwords[7]. In this method, user selected images are registered at the server rst, and then before entering a sensitive region, instead of typing passwords users are presented with several images among which the registered image may exist, users need to select the correct image to nish the authentication; in case the registered image does not exist, select not exist. This method relieves users from remembering long and complex passwords though, it is user aware and remains the authentication at the entrance phase.
3.1. System Architecture and Framework

The Figure 1 shows our system architecture, since mobile devices cannot devote much power to the authentication process, they are clients in our scheme and the analysis is performed on servers. This architecture does not affect mobile devices efciency due to the small size of data transmitted to the server, actually, it improves user transparency because system delay caused by authentication computing does not occur on cell phones. The Data Collection unit collects two kinds of data on a smart cell phone, GPS and application history. When the phone is being used data are transmitted periodically. In time when the device is in idle state, or is in an area with poor signal reception, data are accumulated and transmitted
2.3. Facial/Voice Recognition

These are advanced schemes of cell phone user authentication. They base on biometric information to give high accuracy and can be implemented to be user transparent. With user-facing camera, smart phones can periodically determine the identity of the user. Similarly, voice recognition
978-0-9564263-8/3/$25.002010 IEEE
174
Client
negative training samples as can be seen later.

DATA COLLECTION SECURITY MANAGER
3.2. Data Organization

Two kinds of data are collected and manipulated, namely, GPS data and application history. The organization of them is shown in gure 2.
Server
DATA PREPROCESS
ANALYSIS ENGINE
A
Password Storage User Data Storage RULE GENERATOR Rule Storage
t=ti A
a b A t b B B t C c c D b
time
Figure 1. System Architecture
as soon as the device is manipulated or has acceptable signal reception. The Security Manager unit accepts authentication result from the server. If the result is positive, the unit does nothing. In the presence of a negative result, instead of locking down the device immediately, the unit asks the user for a password that is text-based or image-based and was previously registered in the server. If the password is correct, the authentication result, although previously judged to be negative, is still considered positive; otherwise, the mobile device is locked. The Data Preprocess unit accepts raw data from clients on the server and organizes them into directional graphs, as will be described in 3.2, and then the control is transfered to the Analysis Engine. The Analysis Engine checks if the owner of the coming data has trained rules in the Rule Storage, if not, the processed data are stored and no analysis is performed; otherwise, it compares the coming data against rules from the Rule Storage and returns the result to the client. If the result is positive, the new data are added to the User Data Storage; in case of a negative result, it waits for a password from the client and compares it with the one registered in Password Storage, the data are stored if the password is correct and discarded on the other hand. The Rule Generator unit is activated periodically on the server to update classication rules. As stored user data grow(an upper bound on data size can be set, exceeding which causes user data storage conforming to a FIFO policy), the time required for update increases as well, but luckily the server buys it and users are unaware. We claim that the Security Manager should not lock down the device immediately on a negative classication result from the server because we consider cases in which a user goes on a trip on someday and data are doomed to be different on that day. Data causing a nal authentication failure are discarded since our classication does not need
(i)
(ii)
GPS data node GPS data list
Application history node Application history list
Relation between application and location
Figure 2. Data Structure Application history is a list of applications ordered by the time they are opened on a cell phone, i.e., SafariSMSMail. Every time the foreground application is switched, the application becoming the foreground is appended to the last node in the list (or to become the rst node if the list is empty). Application data are recorded with time stamps at which they are launched, thus, repeat launches of a same application are also detected. GPS data shows the current location of the cell phone, it is usually expressed in the form as a couple of latitude and longitude, e.g., (E 110 22 10 , N 84 52 33 ). This data need to be processed rst. As is shown in Figure 3, locations within square S forms a group, the area of S is pre-dened. The radius of the earth R is approximately 64000km, the length of l is l = S , so we can calculate the range of arcangles = l/R within which to group GPS points. GPS data is accompanied with time stamps at which it is fetched, a GPS location is fetched every t seconds with t = ti initially. GPS data is organized into a list ordered by the time they are sampled. If the newly fetched GPS data turns out
978-0-9564263-8/3/$25.002010 IEEE
175
N 110
30'
S
N 110 10'
N 110 22' 10'' E 84 52' 33''
E 84
35'
E 84
55'
Figure 3. GPS Data Process to be in the same group as the last appended GPS node is, it is not appended. Besides, GPS sampling interval t is set to t, in which > 1; otherwise, t is set to ti . Moreover, t is also set back to ti if it is larger than a pre-dened value, say tmax . The reason for this is that a cell phone user is not likely be moving all the time, thus, it makes some GPS collections redundant. This policy does not affect the collection of application history. Every time when a GPS node is appended to the list, the last node in the application history list is linked to that GPS node and the application history list is cleared so that the last node in the application history list is not linked to multiple GPS nodes. In this way, directional graphs such as (i) and (ii) in Figure 2 are formed. Links in both application history list and GPS data list are unidirectional but are bidirectional between two nodes from each of the two lists. We claim the directions between nodes useful as the order reects a users habit.
ferred applications of a user, as well as the order they are switched; the following three reect the route a user often takes, e.g., the route to work. The rest elements tell the relation between locations and applications, e.g., a professor likes reading e-books in a train to work, a student often plays games in the classroom, when feeling boring, checks SMS and then goes to a restaurant when the class is over, etc. We ignore paths containing back-and-forth routes, e.g., A B b B b c, as the pattern B b can occur innitely. When frequent paths are found, the existence count and the length of each frequent path is also recorded. The discovery of frequent paths in graphs are not as complicated as imagined, many researches such as [10], [11] and [9] have already proposed algorithms to mine frequent patterns.
3.4. Classication
Our authentication is a classication of only two classes positive indicates a genuine user and negative refers to a probable fake user in which case the user is asked to enter a text/image based password to convince the system. Recall that in 3.1, when the server receives raw data from a client, the data is rst preprocessed to be a test sample, then the test sample is compared against each rule of the same user. It seems that the more rules are hit by a test sample, the more likely is that test sample classied as positive. In extremes, if all rules hit/not hit, the test sample is easily classied as positive/negative. But life is not so easy, a test sample seems to have only a part of the rules hit in most time. There is not a denite ratio threshold for determination. Furthermore, rules are not equally reliable, short paths like a and A can easily nd a match, but this is not the case for longer paths like a b c. Intuitively, if the length of a path is longer or if the support of a path is higher, it is more reliable. A larger collection of user data and a higher support threshold are also assurances for better reveal of user habit and biometric feature. Thus, let the number of data of a user in the User Data Storage be M , the number of rules generated from it be N and the number of frequent paths matched in the test sample be n, we dene a positive class preference to be + =
+ 1 + li M threshold , i n l i=1 n
3.3. Rule Generation

Our authentication scheme is based on a rule based classier, those rules are generated from User Data Storage as a training set. We dene our classication rules to be a triple of {frequent path, existence count, length of the path}. By path, we exclude general sub-patterns[1] which means only those that can be found exactly in the graph are counted, for instance, pattern A a is a path of graph (i) in Figure 2, while pattern A c is a general path but is not what we consider here. And by frequent path, we mean those paths with support (existence percentage in the training set) exceeding a pre-dened threshold. E.g., if we dene the threshold of support to be threshold = 0.8, frequent paths in graphs (i) and (ii) of Figure 2 are f = {a, b, c, a b, b c, A, B, A B, B b, A B b, B b c, A B b c, b B }. The implication of this set is obvious, the rst ve elements indicate the pre-
+ in which, i is the support of the ith frequent path matched + in the test sample, li is the length of this path and l is the 1 average length of frequent paths. The term n generates the average credit evaluated by those rules hit. The meaning of this equation is straight forward and a larger value of + suggests a higher probability of a positive classication. Reversely, we can express with a similar equation i and by denition i = C M in which Ci is the existence count
978-0-9564263-8/3/$25.002010 IEEE
176
of frequent path i, we have = = +

+ n + li 1 M threshold n i=1 i l N n lj 1 M threshold N j =1 j n l
Table 1. Experiment Environment Parameters Parameter Initial GPS sampling interval ti Magnifying parameter Maximum sampling interval tmax Area of S to group GPS data Support threshold threshold Value 300s 2 tcurrent tcurrent 900m2 1/24
= =
n 1 M threshold n i=1 M l N n Cj lj 1 M threshold N j =1 M n l n + + C l ( N n ) i=1 i i , N = n. N n j =1 Cj lj n
+ Ci
+ li
Since + and represent the average credits evaluated by rules hit and rules not hit, they are equally convincible, thus we advocate that a value of larger than 1 indicates a positive class; otherwise, a negative classication. A value equal to 1 can be classied into either class with equal probability. The above evaluation works correctly only when the test samples size is similar to those of training samples, e.g., if each training sample contains nodes collected within one hour, but a test sample contains nodes collected over three hours (thus, segmented into three test samples), it probably hits more rules than a test sample that is one third of its size. This situation occurs when a cell phone cannot transfer its collected data promptly due to poor signal reception or being in an idle state, if so, k is evaluated for each k of the segmented test sample and a voting mechanism is adopted because other methods such as taking average value may fail if one segmented test sample missed/hit by all rules which shall make undened.
4.1. Rules and Training Samples

Intuitively, if a person is observed for an enough long time, there should always be some daily habits found. That is, for a given support threshold if there are more training samples, rules generated from them are more stable. By stable, numbers of rules should approach constants and contents of rules should not change much. If this intuition is true, our authentication scheme should start and would give stable performance when rules get stable. Table 2. Number of Rules Day/#Rule 1/355 1/391 1/222 1/451 1/311 1/214 1/381 1/379 1/364 1/189 10/39 10/49 10/43 10/30 10/34 10/60 10/41 10/43 10/234 10/17 15/24 15/31 15/40 15/25 15/28 15/60 15/31 15/36 15/241 15/19 18/21 18/29 18/40 18/25 18/25 18/60 18/24 18/30 18/184 18/15 20/22 20/28 20/40 20/24 20/23 20/60 20/22 20/31 20/175 20/15
User 1 2 3 4 5 6 7 8 9 10
4. Experiments and Results

This section gives two experiments and their results to show how application history and GPS data on a smart phone can reect a users habit, as well as to show the effectiveness of our proposed authentication scheme. Experiment environment parameters are given in Table 1, where tcurrent stands for the current time, and tcurrent means the next whole hour, e.g., if tcurrent is 18 : 20, tcurrent is 19 : 00 and tmax at that time is 40 minutes. With this policy, we can group data to graphs whose rst nodes are aligned to whole hours and we set support threshold to be 1/24 so that habits within each hour can be captured. All experiments were performed on real data collected on ten volunteers iPhones for twenty days. Of all the ten volunteers, one of them is a software engineer who works at the same place ve days per week, the rest nine are all students on two campuses and they do not have to go to school everyday. Two students are roommates and belong to the same campus.
The Table 2 shows the relation between number of rules generated and length of data collection period. For each user, we observe the number of rules generated on some day and it seems that most users start to have their rules stable between day 10 and day 15, with the exceptions of user 6 and user 9. User 6 went on a trip and left his cell phone shut down at home while user 9 is looking for a job and thus went to different companies seminars in that month. We also note that the size of data collected per day is unstable, the reason to this is various, e.g., the user was in a place with poor GPS reception; the cell phone was shut down or the battery dead, etc. This can affect the speed of rules stabilization.
978-0-9564263-8/3/$25.002010 IEEE
177
4.2. Classication Accuracy

Each users rst 15 days data were taken as training set and the remaining 5 days as test set (for user 6 and 9, all of their data are in test set). Then for each user, 25 test samples were randomly picked from his/her own test set and other users test set, respectively, forming a test set with total 50 test samples. Table 3 shows results of the experiment. The average accuracy is 0.76 and thus shows our schemes effectiveness. Table 3. Experiment Accuracy Trail 1 2 3 4 Accuracy Trail Accuracy 36/50 5 43/50 38/50 6 39/50 32/50 7 41/50 41/50 8 34/50
5. Conclusion
In this paper, we proposed a user authentication scheme on smart phones using a data mining method. We collected application history and GPS data that are available on almost every smart phone and performed classication on them. GPS data and application history are prone to reect a users habit, they are organized in directional graphs and frequent paths are extracted to form classication rules. Our classier takes only positive data as training set, and we introduced a measuring method for class determination. We showed with experiments that rules tend to be stable with the accumulation of samples and we proved the effectiveness of our method by classication accuracy.
[2] A. M. N.L. Clarke. The application of signature recognition to transparent handwriting verication for mobile devices. Information Management & Computer Security, 15(3):214 225, 2007. [3] S. F. N.L. Clarke. Authentication of users on mobile telephones a survey of attitudes and practices. Computers & Security, 24(7):519527, October 2005. [4] S. F. N.L. Clarke. Advanced user authentication for mobile devices. Computers & Security, 26(2):109119, March 2007. [5] S. F. N.L. Clarke. Authenticating mobile phone users using keystroke analysis. International Journal of Information Security, 6(1):114, January 2007. [6] S. F. N.L. Clarke, S. Karatzouni. Transparent facial recognition for mobile devices. Proceedings of the 7th Security Conference, June 2008. [7] H. K. Tetsuji Takada. Awase-e: Image-based authentication for mobile phones using users favorite images. Mobile HCI, 2795:347351, October 2003. [8] Wikipedia. Personal identication number. http://en.wikipedia.org/wiki/Personal identication number, March 2010. [9] J. H. Xifeng Yan. gspan: graph-based substructure pattern mining. Data Mining, 2002. ICDM 2002. Proceedings. 2002 IEEE International Conference, pages 721724, March 2002. [10] M. J. Zaki. Efciently mining frequent trees in a forest: Algorithms and applications. IEEE Transactions on Knowledge and Data Engineering, 17(8):10211035, August 2005. [11] M. J. Zaki. A unied approach to rooted tree mining: Algorithms and applications. Mining Graph Data, John Wiley and Sons, Inc, pages 381410, April 2006.
6. Future Work
Although our scheme showed effectiveness, it is not so efcient in the way that it needs a certain period of time to collect samples before triggering the authentication phase; also, in our experiments authentication is aligned to hours which might be too long and the incompleteness of data collected due to poor signal receptions can prevent rules from getting stable. To overcome these, we are considering combining technologies such as keystroke analysis into our scheme.
7. References
[1] C. C. A. Mohammed J. Zaki. Xrules: An effective structural classier for xml data. Machine Learning Journal, 62(12):137170, February 2006.
978-0-9564263-8/3/$25.002010 IEEE
178

Smart Phone Data Mining

Încărcat de

Informații document

Descriere originală:

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Smart Phone Data Mining

Încărcat de

Drepturi de autor:

Formate disponibile

User Authentication on Smart Phones Using a Data Mining Method

2.4. Keystroke Analysis

2.1. Personal Identication Number

3. Data mining based authentication

2.2. Image-Based Password Authentication

3.1. System Architecture and Framework

2.3. Facial/Voice Recognition

negative training samples as can be seen later.

3.2. Data Organization

Figure 1. System Architecture

GPS data node GPS data list

Application history node Application history list

Relation between application and location

N 110 22' 10'' E 84 52' 33''

3.3. Rule Generation

of frequent path i, we have = = +

n 1 M threshold n i=1 M l N n Cj lj 1 M threshold N j =1 M n l n + + C l ( N n ) i=1 i i , N = n. N n j =1 Cj lj n

4.1. Rules and Training Samples

4. Experiments and Results

4.2. Classication Accuracy

S-ar putea să vă placă și