Documente Academic
Documente Profesional
Documente Cultură
Yujin Tang lerry c@ruri.waseda.jp Nakazato Hidenori nakazato@waseda.jp Yoshiyori Urano urano@waseda.jp
Graduate School of Global Information and Telecommunication Studies, Waseda University, Japan
Abstract
The 21st century witnessed the wide spread of smart phones such as iPhone. The daily growing importance of smart phones also implies the increasing amount of user sensitive data stored in a cell phone, which positions mobile user authentication in an even more important role. Existing mobile user authentication methods either require special hardware or are not user transparent. In this paper, we present a mobile user authentication scheme using a data mining method that identies a user based on cell phones application history and GPS information. These data can be collected on almost every smart phone without user awareness and are prone to reect a users habit and biometric feature. We organize these data in directional graphs and introduce a metric based on which to classify the data. Experiments and results on real data are explained to show our schemes effectiveness.
1. Introduction
Thanks to the advanced mobile telecommunication technologies, smart phones are able to provide much daily convenience. In Japan, for instance, people can read an online book, check their email or surf the Internet to kill their boring time while they are in a train to work; they can nd a restaurant they have never been to with the navigation function; etc. The sequential product release of smart phones such as Apples iPhone and Googles Nexus One proves that smart phones are playing a growing important role in our daily life. But that also implies the increasing amount of user sensitive data stored in a cell phone, e.g, credit card number, email account password, etc. If a cell phone falls into the hand of a malicious user, the loss to the genuine user would far beyond the cost of the device. And this suggests the authentication of mobile users in an even more important position. The dominant method for achieving user authentica-
tion on some mobile devices is the use of 4-8 digit Personal Identication Numbers (PINs) which can be applied to both the device and the users Subscriber Identity Module (SIM)[3], but this technology suffers from two main disadvantages. The obvious one is that users can get annoyed while repeatedly typing passwords, especially for those who have long and complex ones for the sake of security. In order to overcome this, some applications record the password and remain the authentication effective for a certain period of time. But that is where the second disadvantage lies the password mechanism only exams the user at the entrance. Although some methods such as [7] improved the mechanism, the disadvantages mentioned above still remain. Some smart phones provide more advanced mechanisms such as ngerprint scanning and iris scanning. For instance, in 2004 NTT DoCoMos F505i handset was equipped with a built-in ngerprint sensor targeted at protecting its users from device theft. With higher accuracy, though, this method requires the help of special hardware which not every smart phone possesses. Furthermore, ngerprint scanning and iris scanning are considered intrusive[3]. [5] and [6] offered good methods to settle the problem, but keystroke analysis may fail when facing a large number of samples and facial/voice recognition requires considerable amount of computing which is obviously not appropriate on a battery limited mobile device. Despite that the analysis work can be relieved by a remote server, the amount of data to be transferred is still considerable and can take bandwidth. In this paper, we present a mobile user authentication scheme based on a data mining method that utilizes the application history and the GPS information on a smart phone. This information can be collected easily on almost every smart phone without user awareness and is prone to reect a users habit as well as some biometric feature. We organize such data in directional graphs and then identify a user by using a rule-based classier. The training phase takes positive training data only and we introduce a new metric based on which classication decision is made. Our method is user transparent and experiments on real data proved it to
978-0-9564263-8/3/$25.002010 IEEE
173
be effective. Like many other researches on security, our authentication framework may leave us some future work, but what we emphasize in this paper is the utilization of the data mining method. The rest of the paper is organized as following: section 2 gives related work on this topic; section 3 describes our data mining method based authentication, including the system architecture, the organization of data, the generation of classication rules and the rule based classication; experiments and results are presented in section 4; section 5 and 6 conclude our current research and tell about our future work.
can be performed to identify a user when he/she makes a phone call. Although equally accurate, facial recognition may surpass voice recognition if the user tends to make less phone calls. Despite of the fact that most smart phones are equipped with cameras, some may contain no user-facing cameras, which limits the adoption of this method. Furthermore, both facial recognition and voice recognition need considerable computing power and are not considered appropriate on battery limited mobile devices. Even with the help of some remote analysis servers, the size of data to be transferred can take a considerable amount of bandwidth.
2. Related Work
This section gives some related work regarding user authentication on smart phones without the help of special hardwares.
Keystroke analysis outperforms facial/voice recognition in that its required computing power is relatively less while still remaining user biometric features. [5] tells details of this method and also shows satisfying results. However, since [5] considers keystroke latency and hold-time characteristic only, with the presence of a larger number of samples there would probably exist multiple people with close typing habits and thus confusing the system.
978-0-9564263-8/3/$25.002010 IEEE
174
Client
Server
DATA PREPROCESS
ANALYSIS ENGINE
A
Password Storage User Data Storage RULE GENERATOR Rule Storage
t=ti A
a b A t b B B t C c c D b
time
as soon as the device is manipulated or has acceptable signal reception. The Security Manager unit accepts authentication result from the server. If the result is positive, the unit does nothing. In the presence of a negative result, instead of locking down the device immediately, the unit asks the user for a password that is text-based or image-based and was previously registered in the server. If the password is correct, the authentication result, although previously judged to be negative, is still considered positive; otherwise, the mobile device is locked. The Data Preprocess unit accepts raw data from clients on the server and organizes them into directional graphs, as will be described in 3.2, and then the control is transfered to the Analysis Engine. The Analysis Engine checks if the owner of the coming data has trained rules in the Rule Storage, if not, the processed data are stored and no analysis is performed; otherwise, it compares the coming data against rules from the Rule Storage and returns the result to the client. If the result is positive, the new data are added to the User Data Storage; in case of a negative result, it waits for a password from the client and compares it with the one registered in Password Storage, the data are stored if the password is correct and discarded on the other hand. The Rule Generator unit is activated periodically on the server to update classication rules. As stored user data grow(an upper bound on data size can be set, exceeding which causes user data storage conforming to a FIFO policy), the time required for update increases as well, but luckily the server buys it and users are unaware. We claim that the Security Manager should not lock down the device immediately on a negative classication result from the server because we consider cases in which a user goes on a trip on someday and data are doomed to be different on that day. Data causing a nal authentication failure are discarded since our classication does not need
(i)
(ii)
Figure 2. Data Structure Application history is a list of applications ordered by the time they are opened on a cell phone, i.e., SafariSMSMail. Every time the foreground application is switched, the application becoming the foreground is appended to the last node in the list (or to become the rst node if the list is empty). Application data are recorded with time stamps at which they are launched, thus, repeat launches of a same application are also detected. GPS data shows the current location of the cell phone, it is usually expressed in the form as a couple of latitude and longitude, e.g., (E 110 22 10 , N 84 52 33 ). This data need to be processed rst. As is shown in Figure 3, locations within square S forms a group, the area of S is pre-dened. The radius of the earth R is approximately 64000km, the length of l is l = S , so we can calculate the range of arcangles = l/R within which to group GPS points. GPS data is accompanied with time stamps at which it is fetched, a GPS location is fetched every t seconds with t = ti initially. GPS data is organized into a list ordered by the time they are sampled. If the newly fetched GPS data turns out
978-0-9564263-8/3/$25.002010 IEEE
175
N 110
30'
S
N 110 10'
E 84
35'
E 84
55'
Figure 3. GPS Data Process to be in the same group as the last appended GPS node is, it is not appended. Besides, GPS sampling interval t is set to t, in which > 1; otherwise, t is set to ti . Moreover, t is also set back to ti if it is larger than a pre-dened value, say tmax . The reason for this is that a cell phone user is not likely be moving all the time, thus, it makes some GPS collections redundant. This policy does not affect the collection of application history. Every time when a GPS node is appended to the list, the last node in the application history list is linked to that GPS node and the application history list is cleared so that the last node in the application history list is not linked to multiple GPS nodes. In this way, directional graphs such as (i) and (ii) in Figure 2 are formed. Links in both application history list and GPS data list are unidirectional but are bidirectional between two nodes from each of the two lists. We claim the directions between nodes useful as the order reects a users habit.
ferred applications of a user, as well as the order they are switched; the following three reect the route a user often takes, e.g., the route to work. The rest elements tell the relation between locations and applications, e.g., a professor likes reading e-books in a train to work, a student often plays games in the classroom, when feeling boring, checks SMS and then goes to a restaurant when the class is over, etc. We ignore paths containing back-and-forth routes, e.g., A B b B b c, as the pattern B b can occur innitely. When frequent paths are found, the existence count and the length of each frequent path is also recorded. The discovery of frequent paths in graphs are not as complicated as imagined, many researches such as [10], [11] and [9] have already proposed algorithms to mine frequent patterns.
3.4. Classication
Our authentication is a classication of only two classes positive indicates a genuine user and negative refers to a probable fake user in which case the user is asked to enter a text/image based password to convince the system. Recall that in 3.1, when the server receives raw data from a client, the data is rst preprocessed to be a test sample, then the test sample is compared against each rule of the same user. It seems that the more rules are hit by a test sample, the more likely is that test sample classied as positive. In extremes, if all rules hit/not hit, the test sample is easily classied as positive/negative. But life is not so easy, a test sample seems to have only a part of the rules hit in most time. There is not a denite ratio threshold for determination. Furthermore, rules are not equally reliable, short paths like a and A can easily nd a match, but this is not the case for longer paths like a b c. Intuitively, if the length of a path is longer or if the support of a path is higher, it is more reliable. A larger collection of user data and a higher support threshold are also assurances for better reveal of user habit and biometric feature. Thus, let the number of data of a user in the User Data Storage be M , the number of rules generated from it be N and the number of frequent paths matched in the test sample be n, we dene a positive class preference to be + =
+ 1 + li M threshold , i n l i=1 n
+ in which, i is the support of the ith frequent path matched + in the test sample, li is the length of this path and l is the 1 average length of frequent paths. The term n generates the average credit evaluated by those rules hit. The meaning of this equation is straight forward and a larger value of + suggests a higher probability of a positive classication. Reversely, we can express with a similar equation i and by denition i = C M in which Ci is the existence count
978-0-9564263-8/3/$25.002010 IEEE
176
Table 1. Experiment Environment Parameters Parameter Initial GPS sampling interval ti Magnifying parameter Maximum sampling interval tmax Area of S to group GPS data Support threshold threshold Value 300s 2 tcurrent tcurrent 900m2 1/24
= =
+ Ci
+ li
Since + and represent the average credits evaluated by rules hit and rules not hit, they are equally convincible, thus we advocate that a value of larger than 1 indicates a positive class; otherwise, a negative classication. A value equal to 1 can be classied into either class with equal probability. The above evaluation works correctly only when the test samples size is similar to those of training samples, e.g., if each training sample contains nodes collected within one hour, but a test sample contains nodes collected over three hours (thus, segmented into three test samples), it probably hits more rules than a test sample that is one third of its size. This situation occurs when a cell phone cannot transfer its collected data promptly due to poor signal reception or being in an idle state, if so, k is evaluated for each k of the segmented test sample and a voting mechanism is adopted because other methods such as taking average value may fail if one segmented test sample missed/hit by all rules which shall make undened.
User 1 2 3 4 5 6 7 8 9 10
The Table 2 shows the relation between number of rules generated and length of data collection period. For each user, we observe the number of rules generated on some day and it seems that most users start to have their rules stable between day 10 and day 15, with the exceptions of user 6 and user 9. User 6 went on a trip and left his cell phone shut down at home while user 9 is looking for a job and thus went to different companies seminars in that month. We also note that the size of data collected per day is unstable, the reason to this is various, e.g., the user was in a place with poor GPS reception; the cell phone was shut down or the battery dead, etc. This can affect the speed of rules stabilization.
978-0-9564263-8/3/$25.002010 IEEE
177
5. Conclusion
In this paper, we proposed a user authentication scheme on smart phones using a data mining method. We collected application history and GPS data that are available on almost every smart phone and performed classication on them. GPS data and application history are prone to reect a users habit, they are organized in directional graphs and frequent paths are extracted to form classication rules. Our classier takes only positive data as training set, and we introduced a measuring method for class determination. We showed with experiments that rules tend to be stable with the accumulation of samples and we proved the effectiveness of our method by classication accuracy.
[2] A. M. N.L. Clarke. The application of signature recognition to transparent handwriting verication for mobile devices. Information Management & Computer Security, 15(3):214 225, 2007. [3] S. F. N.L. Clarke. Authentication of users on mobile telephones a survey of attitudes and practices. Computers & Security, 24(7):519527, October 2005. [4] S. F. N.L. Clarke. Advanced user authentication for mobile devices. Computers & Security, 26(2):109119, March 2007. [5] S. F. N.L. Clarke. Authenticating mobile phone users using keystroke analysis. International Journal of Information Security, 6(1):114, January 2007. [6] S. F. N.L. Clarke, S. Karatzouni. Transparent facial recognition for mobile devices. Proceedings of the 7th Security Conference, June 2008. [7] H. K. Tetsuji Takada. Awase-e: Image-based authentication for mobile phones using users favorite images. Mobile HCI, 2795:347351, October 2003. [8] Wikipedia. Personal identication number. http://en.wikipedia.org/wiki/Personal identication number, March 2010. [9] J. H. Xifeng Yan. gspan: graph-based substructure pattern mining. Data Mining, 2002. ICDM 2002. Proceedings. 2002 IEEE International Conference, pages 721724, March 2002. [10] M. J. Zaki. Efciently mining frequent trees in a forest: Algorithms and applications. IEEE Transactions on Knowledge and Data Engineering, 17(8):10211035, August 2005. [11] M. J. Zaki. A unied approach to rooted tree mining: Algorithms and applications. Mining Graph Data, John Wiley and Sons, Inc, pages 381410, April 2006.
6. Future Work
Although our scheme showed effectiveness, it is not so efcient in the way that it needs a certain period of time to collect samples before triggering the authentication phase; also, in our experiments authentication is aligned to hours which might be too long and the incompleteness of data collected due to poor signal receptions can prevent rules from getting stable. To overcome these, we are considering combining technologies such as keystroke analysis into our scheme.
7. References
[1] C. C. A. Mohammed J. Zaki. Xrules: An effective structural classier for xml data. Machine Learning Journal, 62(12):137170, February 2006.
978-0-9564263-8/3/$25.002010 IEEE
178