Sunteți pe pagina 1din 7

Writer Identification Method Based on Forensic Knowledge

Marino Tapiador1 and Juan A. Sigenza2


1

IBM Global Services and Universidad Autnoma de Madrid Escuela Politcnica Superior, Spain marino_tapiador@es.ibm.com 2 Universidad Autnoma de Madrid Escuela Politcnica Superior, Spain jalberto.siguenza@ii.uam.es

Abstract. Police corps have been extensively used Forensic techniques to perform criminal identification. One of these techniques is questioned document examination. Forensic document examiners can identify individuals in big populations using a classification of the forms of manuscript characters, i.e. a handwriting formulation. This paper presents a method that defines a handwriting formulation that allows high identification accuracy minimizing the amount of data used and the sample size. Thus the method improves the query performance in a writing specimens database and it reduces the storage requirements. Experiments with results of 100% accuracy on the identification of 20 criminals in a real forensic database are presented.

Introduction

Questioned document examination is a usual technique used by police corps around the world since the early 1600's. Different methods have been developed by forensic scientists in order to verify or identify criminals using multiple manuscript features. Methodological aspects of handwriting identification have been intensively studied time ago by forensic document examiners as described in [1]. The individual's handwriting, and the identification of authors of questioned handwritten documents has great importance on the criminal justice system from two points of view: verification and identification. The work described in this paper was focused on offline identification procedures (due working with confiscated manuscripts) and it was developed in collaboration with a forensic laboratory of a Spanish police corps. Handwriting identification can be included as a particular kind of dynamic biometrics ([2]) where the shapes and relationships of writing strokes are used as biometric features for authenticating an identity i.e. we try to measure the individual's behavior. These features are used in the case of forensic handwriting analysis to identify a criminal in a group of suspects. Several works related to off-line writer identification can be found in biometrics e.g. [3]-[5], and the most important text-dependent study about handwriting individuality -due the great volume of samples used: 1000 volunteers- that we are aware was performed by Srihari et al. [3]. The identification algorithm used was the traditional Nearest Neighbor (NN) rule [6] and best accuracy was achieved using
D. Zhang and A.K. Jain (Eds.): ICBA 2004, LNCS 3072, pp. 555-561, 2004. Springer-Verlag Berlin Heidelberg 2004

556

Marino Tapiador and Juan A. Sigenza

character level features (GSC, Gradient-Structural-Concavity features) using 8 characters of a fixed word in a fixed text (text-dependency). That work revealed a decreasing performance from 98% accuracy with 2 writers to 72% accuracy using 1000 writers (1000 different classes). Other important research was developed by Said et al. [4], with the main objective of developing a text-independent handwriting identification system. It used texture analysis for writer identification from non-uniformly skewed handwriting images. An identification accuracy of 96% was obtained with 150 test documents. Finally, another relevant work was done by Cohen et al. [5], due to its focus on character examination. This work was mainly centered on B-Spline curve representation and as an example, it covered a couple of use cases where one of them was writer-screening identification related to affine invariant matching and classification of 2-D curves. The author's experiment consisted of 10 writers, each of which has written the characters of the alphabet 10 times. A 100% classification rate was achieved using the full alphabet i.e. 26 characters (English). Considering all this previous references, the goal of the work described in this paper was to design a method in order to build an automatic forensic system with as high accuracy as the work of Cohen [5] or higher, and to use it for criminal identification i.e. not using the friendly samples described by Cohen, and with a reduced sample size which enables efficient search and storing of samples in a largescale forensic database.

Our Method

2.1 Acquiring the Forensic Knowledge The writer identification method used by the forensic group participating in this work is based on manually reviewing an archive of handwritten documents confiscated to several criminals. Therefore forensic document examiners cannot have a perfect fullalphabet sample from suspects, they usually have to work with very limited handwriting samples e.g. a couple of lines in a notebook confiscated, a photocopy of a contract, or so forth. The method used by the experts in order to compare two different questioned documents consist of generate a document classification based on the forms of the characters (in our terms, formulation). This method increases the manual identification speed and is based on several kinds of relevant characters, so each manuscript can be formulated in terms of the kind of characters it contains. For example, if one of the characters present in the formulation is the letter K for this class of letter the forensic expertise says there are only three main shape variations: these subclasses have associated three formulation codes -K1, K2, and K3- and thus if the document contains the K letter it also will have in its formulation one of these three formulation codes. Collecting all the formulation codes of characters contained into the document, we will have the formulation that label that documents e.g. let X be a questioned document, the formulation could be F(X)={A2,B4,D2,G4,K3,M2,a3,f1,r3,z1}. Having the questioned documents formulated in these terms, writer identification consist of formulating the questioned document and comparing the list of codes for each document in order to find the more similar document of a registered criminal.

Writer Identification Method Based on Forensic Knowledge

557

Because each handwritten document could be labeled with this special formulation (list of letter codes), a database was built using the thousands of documents in the archive and registering just the formulation associated to each document. The idea was to be able to perform several queries over this database using the formulation in order to find documents with the same or similar formulation originated by new questioned documents. But this type of database revealed to be insufficient when the formulation process started due to the problem of subjective interpretations. Different experts can classify the same character sample in different classes by visual analysis: for instance, if an expert E1 formulates a questioned document from author X and searches it in the database where there is the genuine document from author X formulated by another expert E2, the document could not be found because the formulations can be different as they were provided from different experts. For example, figure 1 shows the case of an A' character sample from an anonymous writer that could be classified in a different category by different experts. After visual examination, expert 1 could decide to formulate the A' as type A4 i.e. A with centered tilde and curved head', but expert 2 could decide to classify it as type A7 i.e. A with centered tilde and squared head'.

Fig. 1. Different experts can formulate in a different way Table 1. Letter Formulation

Uppercase A B D E G J K M Q R

#Variations 8 6 6 6 3 6 3 5 2 5

Lowercase

#Variations

Digits

#Variations

7 4 a 4 5 4 b 8 6 d 7 f 4 g 4 m 4 p 4 q 4 r 5 t 2 z Our method does not consider the subclasses included in the formulation for each kind of letter in order to avoid these interpretation problems but it uses the forensic

558

Marino Tapiador and Juan A. Sigenza

knowledge about which are the letters with more shape variation, and for this reason the more valuable for classification. The forensic knowledge was condensed in a new formulation by reviewing an archive of 5,000 criminal documents (manuscripts). The forensic examiners summarized the letters with more shape variation in this formulation shown by table 1. 2.2 Data Collection System The system works with digitized images of the original confiscated/authenticated documents or with photocopies of them (A4 size), provided by the Spanish forensic lab collaborating in the project. The documents are scanned into digital form using a resolution of 300 DPI in a PC with Microsoft Windows. The full document image is stored in the system's hard disk in bitmap image format (BMP) using 24 bpp. With the digitized full-document ready from an individual, the next stage is the character segmentation. A standard digital image-manipulating tool is used, extending it with an automatic action (macro): the system operator has to draw a character selection using the computer mouse, and to label it. All these letter samples are stored in a database where there are samples and users (questioned' or genuine'). Because the forensic procedure we were attempting to automate was based just on character examination we included in the system the character level features used in the study of Srihari [3], and we designed an intra-character level comparative study that will be discussed later in this paper. Next stage is the normalization process: the bitmap image is converted to gray scale, and an average spatial filter is applied to smooth the character image -noise elimination-. Finally the character is binarized to black and white format using the Otsu's binarization algorithm [7] and the margins are dropped by means of calculating the bounding box of the image. The feature extraction process is applied to the resulting image from the normalization. The digital features are registered in the system's database. The features are GSC and geometric (character level). GSC is a group of features that consist of gradient, structural and concavity features, and the geometric features are related to several geometric properties of the letter shape. These features are the total number of black pixels, the height-width ratio, the centroid-height ratio, the centroid-width ratio, and a 9-spatial sensor (see [3]). All these features are compiled in a total binary feature vector. This vector is stored into the system's database and is used during the identification process in order to decide the similarity between character samples of individuals. 2.3 Identification System The data collection module is used to capture character samples only for the letters contained in the formulation considered. Maybe not all the letters in the formulation can be included in the database because samples come from criminal documents confiscated in real life, and only some letters are in them. The identification method used by the system is the traditional Nearest Neighbor (NN) algorithm [6]. This algorithm has been applied in a similar way as in the Said work [4] using the Euclidean distance in order to identify a particular unknown individual. Considering a total of N genuine documents, given all the handwriting

Writer Identification Method Based on Forensic Knowledge

559

images (character samples) in a document Di , i N for a genuine individual in the database, the total binary feature vector is computed as described in section 2.2 for each character image. Each handwritten genuine document Di is thus described by the set of character vectors cj it is made of: Di = {cj , j card(Di) } , i N. (1) New character samples from a new questioned or unknown individual follow the same process and their feature vectors are also computed. Let be L(cj) the letter corresponding to the character sample of cj , a distance measure DIS between a questioned handwritten document Q and a genuine document Di for a particular type of letter C {Formulation} in the database can be defined according to the following relation (2). It means that to identify the unknown user, all his/her character samples registered are considered, and for each character its feature vector is compared with all the characters from genuine users in the database. Only the same kinds of letters are compared (A' vs. A', b' vs. b' and so forth in the formulation) and in general they come from different texts due the text-independence requirement of our experiments (it means they are characters not included in the same word). DIS(Q ,Di) = F-1 CF M-1 pM Min( dis(xp ,yq) ), where L(xp) = L(yq) = C xpQC , yqDiC , M = card(QC) , C {Formulation}, F = card({Formulation}) (2)

Comparing two binary feature vectors consists of generating a distance vector with a questioned character and a genuine character. The distance vector consists of several components -sub distances- for the different types of features. Each component is computed using the Euclidean distance dis(xp ,yq) between the corresponding feature vectors of the unknown and genuine character images. Given the binary feature vectors of two characters A and B, the distance measure is: dis(Va , Vb) = (in (Vai Vbi )2

, n = card(Va) = card(Vb)

(3)

Therefore, two documents will be all the closer as this measure will be close to zero. The writer of document Q will be determined as the writer of the closest document in the database: Writer(Q) = Writer( Argi min ( DIS(Q,Di) ) ) , Di for i N= card({database}) (4)

Experiments

Using the method previously described a forensic database was created using documents confiscated to real criminals or authenticated in presence of a policeman. It means the documents were acquired in real forensic conditions. This is an important point compared with the experiments of the other works cited in the introduction of this paper: in that works the writing samples were obtained with the collaboration of volunteers under controlled conditions. Considering only characters in the formulation proposed in table 1 allowed capturing and digitizing a reduced volume of 1,400 character samples. A total of 20 genuine individuals and 20 questioned individuals were created in the database

560

Marino Tapiador and Juan A. Sigenza

associated to the samples. For a particular individual, the tool operator (forensic expert) digitized an authenticated document and created two users in the database: a genuine user and a questioned/unknown user. The 80% of the samples are used to register the genuine individuals and the other 20% is used in order to test the identification accuracy of the system. The identification method described in section 2.3 was used with all the questioned users registered in the database and the experimental results are summarized in figure 2. This figure describes a comparison between the identification accuracy obtained for the different types of character level features used in the process. The graph shows in a visual way the ordered accuracy levels (%) with each kind of features associated to each graph block.

4 Conclusions
This writer identification system shows a 100% accuracy with 20 individuals using only gradient features (figure 2), better than the 96% accuracy for 10 individuals provided by Said et al. [4], as the more important reference for a text-independent identification system (not considering the basic experiments of Cohen [5] with fullalphabet ideal' samples and with the computational complex B-splines technique). Compared with the text-dependent experiments of Srihari [3], the accuracy is also better because in that work accuracy has already gone down to 96% with only 10 individuals in the experiments. The analysis of our results suggests that the basic idea of the formulation technique described in the paper can be used in order to reduce the sample size and also in order to have a reduced number of vectors in the database. For the formulation presented in table 1, the identification method allows to use gradient feature vectors of 192 bits instead of GSC feature vectors of 512 bits: a 62% information reduction. These are significant advantages in writer identification systems in the area of forensic or criminal identification where the questioned individuals belong to huge populations that imply searching and storing data in large-scale databases. Future experiments will be focused on increasing the database volume to get more accurate conclusions; due using only 20 individuals is a limited amount of data.

Acknowledgements
The authors would like to thank all the staff at Laboratorio de Grafstica', from Servicio de Polica Judicial' at Direccin General de la Guardia Civil' for their valuable help. This work was supported in part by the Spanish Government (MCYT) under Project FIT-070200-2002-131.

Writer Identification Method Based on Forensic Knowledge

561

100 90 80 70 60 50 40 30 20 10 0

100

95

90

90 70

65

40

1 = gradient; 4 = U/D/L/R/H; 7 = coarse pixel;

2 = structural; 5 = large stroke;

3 = all; 6 = geometric;

Fig. 2. Identification Accuracy (%) by Type of Features

References
[1] [2] [3] R. A. Huber and A. M. Headrick, Handwriting identification: facts and fundamentals. Ed. CRC Press (1999). A. K. Jain, R. Bolle, and S. Pankanti, Biometrics. Personal identification in networked society. Ed. Kluwer Academic Publishers (1999). S. N. Srihari et al., Handwriting identification: research to study validity of individuality of handwriting and develop computer-assisted procedures for comparing handwriting, Tech. Rep. CEDAR-TR-01-1, Center of Excellence for Document Analysis and Recognition, University at Buffalo, State University of New York, Feb. (2001), 54 pp. H. E. S. Said, G. S. Peake, T. N. Tan and K. D. Baker, Writer identification from nonuniformly skewed handwriting images, in Proc. 9th. British Machine Vision Conference, (2000) pp. 478-487. F. S. Cohen, Z. Huang and Z. Yang, Invariant matching and identification of curves using b-splines curve representation, IEEE Transactions on Image Processing, vol. 4, no. 1, pp. 1-10, Jan. (1995). R. O. Duda, P. E. Hart, Pattern Classifications. 2nd. Ed. John Wiley & Sons (1973). N. Otsu, A threshold selection method from gray-scale histogram, IEEE Transactions System, Man and Cybernetics, vol. 9, pp. 62-66, Jan.(1979).

[4] [5] [6] [7]

S-ar putea să vă placă și