Sunteți pe pagina 1din 5

2011 International Conference on Pattern Analysis and Intelligent Robotics 28-29 June 2011, Putrajaya, Malaysia

Spelling Error Detector Rule for Jawi Stemmer


Suliana Sulaiman 1, Khairuddin Omar , Nazlia Omar 3, 0RKG Zamri Murah 4, Hamdan Abdul Rahman 5
Faculty of Art, Computing and Creative Industry Universiti Pendidikan Sultan Idris,Tanjong Malim,Perak,Malaysia
1

suliana@fskik.upsi.edu.my

Recognition Research Lab, CAIT Faculty of Information Science and Technology Universiti Kebangsaan Malaysia, Bangi, Malaysia
{ko,no,zamri,hamdan@ftsm.ukm.my} Abstract Stemmer is important especially for information and document retrieval. It can also help to reduce the size of the dictionary. Normally Malay stemmers need to have a root word dictionary to increase the stemmers accuracy. In Jawi stemmer, we use Jawi spelling error rule to detect whether the program produces the correct stemmed word after all possible affixes have been removed. Jawi spelling error rule has been tested using 3018 data in Jawi with two syllables root word and the result was compared manually. The result shows 97.8% accuracy of Jawi spelling word with two syllables which have been checked correctly using the spelling error detector rule. Keywords Spelling error detector rule, Jawi.

*Pattern

Section 5 discusses the result of our experiment and conclusion has been made in section 6. II. RELATED WORK We can categorize error in Malay stemmer as understemming, overstemming, unchanged and spelling exception [6]. The first Malay stemmer has been developed by Asim Othman in 1993[7]. He used 121 set of rules to stem all possible affixes in Rumi and compared the stemmed word with dictionary. He started with prefix-suffix rules and followed by prefix rules, suffix rules and infix rules. If the stemmed word produced did not match with the dictionary then he applied prefix rules and repeat the same process until all of the affix rules are applied. In that case stemmed word need to be matched with the dictionary. In 1995, Fatimah [8] came out with her own Malay stemmer. She managed to develop 561 rules for prefix, suffix, prefix suffix and infix. All affix has been tested using these rules and she compared the stemmed word produced with the root word dictionary. N.Idris [9], 2001 tried to reduce the number of rules produced from Fatimahs stemmer. She used only the significant rules for prefix and suffix. To overcome the heterophyllous word, she used extra dictionary and called as local dictionary. The dictionary holds root word for explicit context and highly focused to the application. In 2005 Taufik [10] tried to make enhancement to Fatimah algorithm to reduce the stemming error. He came out with new method called Rule Frequency Order. All of the researcher mentioned above, used the dictionary to make sure they stem the correct word [7][8][9][10]. Even though Malay stemmer can produce the best result for a stemmed word, it is still not suitable for Jawi characters. One of the main reason is the spelling technique to spell Jawi is different compared to Rumi. III. SPELLING ERROR RULE Spelling in Jawi is more complicated compared to Rumi because of the placement of the vowel. Jawi have three vowels such as , and . To spell two syllables in Jawi word correctly we need to follow the five steps as below [11]: Step 1: Using vowels at both syllables Step 2: Using hamzah () between first and second syllables.

I. INTRODUCTION Stemmer is a computer process to reduce all affixes word to its root word [1]. Retrieval effectiveness such as precision especially for language with complex morphology or short query can be increased using stemmer [2]. Stemmer is also capable of reducing morphological variants into its root word and at the same time it helps in improving recall [3]. Most of the languages such as English, Arabic, Malay, France and Dutch have their own stemmer. The earliest English stemmer was created in 1968 by Julie Beth Lovins to process English word into stemmed word [4]. Malay language differs from English language as it has complex morphology compared to English [5]. Malay language can be written either in Rumi or in Jawi. There are many differences between Rumi and Jawi. Rumi should be written from left to right and it is more like English character but Jawi should be written from right to left and it is similar like Arabic character. Spelling of Malay word in Jawi is different compare to spelling Malay word in Rumi. In Jawi stemmer, we use Jawi spelling error detector rule to make sure the produced stemmed word is spelled correctly after each affixes rule were applied. This paper proposed the spelling error detector rule to detect whether the stemmer can produce the correct stemmed word after all possible affixes has been removed. The paper is structured as follows. Section 2 introduces an overview of the related work on spell checking rule for other language. Section 3 explains the spelling technique used to spell Jawi with two syllables. Section 4 shows the experimental result.

978-1-61284-406-0/11/$26.00 2011 DAR-04 

Step 3: Using vowels only at first syllables Step 4: Using vowels only at second syllables Step 5: No vowels at first and second syllables A. Spelling Jawi with Two Syllables Two syllables word is a combination of Open and Close Syllables. Open Syllables (example: ma) has a vowel at the last character meanwhile Close Syllables has a consonant at the last character (example: kan). Jawi with two syllables can be spelt using the combination of Open Syllable + Open Syllable, Close Syllable + Open Syllable, Open syllable + Close syllable and Close Syllable + Close syllable. Details of the two syllables pattern are described in Table 1, Table 2, Table 3 and Table 4.
TABLE 1 COMBINATION OF OPEN SYLLABLES + OPEN SYLLABLES

Vowels in Jawi character are , and while consonants are other Jawi characters except , and . Other syllable that has the combination of characters ai, au or oi in their syllables is identified as Diphthong [11]. Most of the researchers for example A.Othman [7], F. Ahmad [8], N.Idris [9] and Taufik [10], used either dictionary, root word dictionary or local dictionary as a component to make sure the correct stemmed word is produced after affix rules are applied. In this paper we proposed to check the produced stemmed word with Jawi spelling rule to make sure the stemmed word is correct. B. Spelling Error Detector Rule To get the stemmed word from affix word we need to apply Jawi stemming rule. After the prefix rule is applied, it produces the stemmed word as . This word uses the spelling error detector to detect whether the stemmed word produced is spelled correctly. The rules are generated using Table 1, Table 2, Table 3 and Table 4 combined with step 1 to step 5 as mentioned in section II. Figure 1 shows the example of the spelling detector for word .

Polar Vowel + Vowel Vowel + Consonant Vowel Consonant Vowel + Vowel Consonant Vowel + Consonant Vowel Vowel + Consonant Diftong Consonant Vowel + Consonant Diftong

Example [i+a] [i + tu] [du + a] [ko + ta] [a + bai] [pa + loi]

TABLE 2 COMBINATION OF CLOSE SYLLABLES + OPEN SYLLABLES

Polar Vowel Consonant + Consonant Vowel Consonant Vowel Consonant + Consonant Vowel Vowel Consonant + Consonant Diftong Consonant Vowel Consonant + Consonant Diftong

Example [an + da] [ban+tu] [an+dai] [san+tau]


Compare 1st Syllable CV Open Syllable V= 2nd Syllables CV Open Syllable V=

V C

C V

V C

C V

Left to right Right to left

TABLE 3 COMBINATION OF OPEN SYLLABLES + CLOSE SYLLABLES

Polar Vowel + Vowel Consonant Vowel + Consonant Vowel Consonant Consonant Vowel + Vowel Consonant Consonant Vowel + Consonant Vowel Consonant

Example [a+ur] [i+kan] [ma+in] [se+pit]


Wrong No Yes Correct Result Implement spelling error rule

TABLE 4 COMBINATION OF CLOSE SYLLABLES + CLOSE SYLLABLES

Polar Vowel Consonant + Consonant Vowel Consonant Consonant Vowel Consonant + Consonant Vowel Consonant Consonant Diftong + Consonant Vowel Consonant

Example [in+tan]

Fig. 1 Example of spelling error detector for word

[sun+tik] [tau+lan]

After eliminating possible affixes, the stemmed word must use the spelling error detector to make sure it produces the



correct word. The algorithm for Spelling Error Detector Rule is described as below: BEGIN; RETRIEVE word; W1=word; IF (W1 contains vowel, consonant and diphthong) REPLACE V = Vowel; C = Consonant; S = Diphthong; IF V, S and C == close syllables THEN apply close syllables rule & compare with word ELSE apply open syllables rule & compare with word END IF END IF END Jawi vowels can be combined with possible consonant to build a syllable [11]. Table 5 shows part of Jawi spelling rule for open syllables and close syllables.
TABLE 5 PART OF JAWI SPELLING RULE FOR TWO SYLLABLES

(a) + (u)

Pattern

(e pepet) + (a)

Rule IF no vowel at 1st syllables AND vowel at 2nd syllable == THEN use method 4 (apply to Open Syllables + Open Syllables) IF no vowel at 1st syllables AND last character in 2nd syllable != /, THEN use method 4. (apply to Close Syllables + Open Syllables) IF no vowel present at both syllables THEN use method 5 (apply to Open Syllables + Close Syllables and Close Syllables + Close Syllables) IF no vowel at 1st syllables AND vowel at 2nd syllable == THEN use method 4 (apply to all syllables) IF no vowel at 1st syllables AND vowel at 2nd syllable == THEN use method 4 (apply to all syllables) IF vowel at 1st syllables == AND last character in in 2nd syllable != . THEN use method 3. IF vowel at 1st syllables == AND last character in in 2nd syllable == . THEN use method 1. (apply to Open Syllables + Open Syllables) IF no vowel at 1st syllables AND last character in 2nd syllable != /, THEN use method 4. IF no vowel at 1st syllables AND last character in 2nd syllable == /, THEN use method 5. (apply to Close Syllables + Open Syllables) IF vowel at 1st syllables == AND no vowel at 2nd syllables THEN use method 3. (apply to Open Syllables + Close Syllables) IF no vowel present at both syllables THEN use method 5. (apply to Close Syllables + Close Syllables) IF vowel at 1st syllable == AND vowel at 2nd syllable == THEN use method 1 (apply to Open Syllables + Open Syllables)

(i) + (a)

IF no vowel at 1st syllable AND vowel at 2nd syllable == THEN use method 4. (apply to Close Syllables + Open Syllables) IF vowel at 1st syllable == AND 1st character in 2nd syllable == consonant THEN use method 1. IF present between 1st AND 2nd syllable AND 1st character in 2nd syllable == vowel THEN use method 2. (apply to Open Syllables + Close Syllables) IF no vowel at 1st syllable AND vowel at 2nd syllable == THEN use method 4. (apply to Close Syllables + Close Syllables) IF vowel at 1st syllable == AND vowel at 2nd syllable == AND 1st character in 2nd syllable == consonant, THEN use method 1. IF vowel at 1st syllable == AND vowel at 2nd syllable == AND 1st character in 2nd syllable == vowel, THEN use method 2. (apply to Open syllables + Open syllables and Open syllables + Close syllables) IF no vowel at 1st syllable and vowel at 2nd syllable == THEN use method 4. (apply to Close syllables + Open syllables and Close syllables + Close syllables) IF vowel at 1st syllable == AND 1st character in 2nd syllable != / THEN use method 1. IF vowel at 1st syllable == AND 1st character in 2nd syllable == / THEN use method 3. (apply to Open Syllables + Open Syllables) IF vowel at 1st syllable == AND 1st character in 2nd syllable != / THEN use method 1. IF vowel at 1st syllable == AND 1st character in 2nd syllable == / THEN use method 3. (apply to Close Syllables + Open Syllables) IF vowel at 1st syllable == AND 1st character in 2nd syllable == consonant THEN use method 3. IF vowel at 1st syllable == AND 1st character in 2nd syllable == vowel THEN use method 1. (apply to Open Syllables + Close Syllables) IF vowel at 1st syllable == AND vowel at 2nd syllable == THEN use method 3. (apply to Close Syllables + Close Syllables)

(e pepet) + (i) (e pepet) + (u)

(i) + (i) (i) + (u)

IF vowel at 1st syllable == AND vowel at 2nd syllable == THEN use method 1 (apply to all syllables) IF vowel at 1st syllable == AND vowel at 2nd syllable == THEN use method 1(apply to all syllables) IF vowel at 1st syllable == AND last character in 2nd syllable != / use method 1.IF vowel at 1st syllable == AND last character in 2nd syllable == / THEN use method 3. (apply to Open Syllables + Open Syllables and Close syllables + Open syllables) IF vowel at 1st syllable == AND first character in 2nd syllable == THEN use method 1. IF vowel at 1st syllable == AND first character in 2nd syllable == consonant THEN use method 3. (apply to Open syllables + Close syllables)

(a) + (a)

(e) + (a)

(a) + (i)



IF vowel at 1st syllable == AND no vowel at 2nd syllable THEN use method 3. (apply to Close syllables + Close syllables) (e) + (e) (e) + (o) IF vowel at 1st syllable == AND vowel at 2nd syllable == THEN use method 1 (apply to all syllables) IF vowel at 1st syllable == AND vowel at 2nd syllable == THEN use method 1 (apply to all syllables) If vowel at 1st syllable == AND last character in 2nd syllable != / THEN use method 1. If vowel at 1st syllable == and last character in 2nd syllable == / THEN use method 3. (apply to Open syllables + Open syllables and Close syllables + Open syllables) IF vowel at 1st syllable == AND 1st character in 2nd syllable == THEN use method 1. IF vowel at 1st syllable == AND 1st character in 2nd syllable == consonant THEN use method 3. (apply to Open syllables + Close syllables) IF vowel at 1st syllable == AND no vowel at 2nd syllable THEN use method 3 (apply to Close syllables + Close syllables) IF vowel at 1st syllable == AND vowel at 2nd syllable == THEN use method 1 (apply to all syllables) IF vowel at 1st syllable == AND last character in 2nd syllable != / THEN use method 1. IF vowel at 1st syllable == AND last character in 2nd syllable == / THEN use method 3. (apply to Open syllables + Open syllables and Close syllables + Open syllables) IF vowel at 1st syllable == AND no vowel at 2nd syllable THEN use method 3. (apply to Open syllables + Close syllables and Close syllables + Close syllables) IF vowel at 1st syllable == AND vowel at 2nd syllable == THEN use method 1 (apply to all syllables) IF vowel at 1st syllable == AND vowel at 2nd syllable == THEN use method 1(apply to all syllables) If vowel at 1st syllable == AND last character in 2nd syllable != / THEN use method 1. If vowel at 1st syllable == and last character in 2nd syllable == / THEN use method 3. (apply to Open syllables + Open syllables and Close syllables + Open syllables) IF vowel at 1st syllable == AND 1st character in 2nd syllable == THEN use method 1. IF vowel at 1st syllable == AND 1st character in 2nd syllable == consonant THEN use method 3. (apply to Open syllables + Close syllables) IF vowel at 1st syllable == AND no vowel at 2nd syllable THEN use method 3 (apply to Close syllables + Close syllables) IF vowel at 1st syllable == AND vowel at 2nd syllable == THEN use method 1 (apply to all syllables)

IV. RESULT To test the accuracy of the rules, we used 3018 data written in Jawi with two syllables and compared the result manually [11]. The result then was double checked by Tuan Haji Abdul Rahman who is a Jawi expert. The data was divided into 22 groups and cover all patterns from Table 6. Table 6 shows 2951 words which have been correctly checked by the rule while 67 failed to be detected.
TABLE 6 NUMBER OF WORD CORRECTLY CHECKED AND ITS ERROR

Pattern 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Description [e+a]/ + [e+i]/ + [e+u]/ + [a+a]/ + [a+i]/ + [a+u]/ + [i+a]/ + [i+i]/ + [i+u]/ + [e+a]/ + [e+e]/ + [e+o]/ + [u+a]/ + [u+i]/ + [u+u]/ + [o+a]/ + [o+e]/ + [o+o]/ + [x+ai]/x+ [x+aw]/x+ [x+oi]/x+

Correct 387 127 195 221 237 284 248 114 106 48 72 43 151 116 161 100 57 97 128 49 10

Error 10 1 2 4 4 7 1 1 0 29 2 4 1 0 1 0 0 0 0 0 0

Total of Word 397 128 197 225 241 291 249 115 106 77 74 47 152 116 162 100 57 97 128 49 10

(u) + (a)

(u) + (u)

(o) + (a)

(o) + (e) (o) + (o)

(u) + (a)

(u) + (u)



From this table we can conclude that Spelling Error Detector Rule is able to detect 97.8% of Jawi spelling word with two syllables correctly. Figure 2 shows the graph of words correctly checked by the pattern and its error.

V. CONCLUSIONS From the result it shows that 97.8% of two syllables words have been checked correctly using the Spelling Error Detector Rule. In Jawi, character can be interpreted as e pepet or as e taling. The different of these e pepet and e taling can be found in word such as and . For this type of problem Spelling Error Detector Rule cannot detect the word which shouldnt be spelled in e taling instead of e pepet. Our current effort will involve more development on Jawi spelling rules for the three syllables words. REFERENCES
[1] [2] [3] [4] [5] T.A Eiman. and L. Jessica, Towards an Error-Free Arabic Stemming, Communication of ACM. vol.23, pp. 9-14. 2008 R. Krovetz, Viewing Morphology as an Inference Process, Univ. of Massachusetts, Amherst, MA, Tech. Rep. 1-12, 1993 A. K. Pandey and T. J. Siddiqui, An Unsupervised Hindi Stemmer with Heuristic Improvements, in ACM 2008.p 99. J.B Lovins.,. Development of Stemming Algorithm, Mechanical Translation and Computational Linguistic. vol.11, pp 22-31, 1968. N. S. Karim, F. M. Onn, H. Musa and A. H. Mahmood, Tatabahasa DewanEdisi Ketiga, Kuala Lumpur, Malaysia: Dewan Bahasa dan Pustaka, 2008. T.M.T. Sembok, Word Stemming Algorithms and Retrieval Effectiveness in Malay and Arabic Documents Retrieval Systems, WASET, vol 10, pp. 95-97, Nov. 2005. A.Othman,Pengakar perkataan melayu untuk sistem capaian dokumen, MSc. thesis. National University of Malaysia. 1993. F. Ahmad, Experiments with A Malay Stemming Algorithm, PhD, thesis, National Univerity of Malaysia, 1996. N.Idris, S.M.F.D.S Mustafa, Stemming for Term Conflation in Malay Texts,ACM, vol.3, pp.12-17. M.Taufik, F. Ahmad, R. Mahmod and T.M.T Sembok, Rules Frequency Order Stemmer for Malay Language, International Journal of Computer Science and Network Security IJCSNS, vol.9, pp.433438,2009. H.A.Rahman, Panduan Menulis dan Mengeja Jawi. Kuala Lumpur, Malaysia: Dewan Bahasa dan Pustaka, 1999.

Fig 2 Graph shows word correctly checked and its error. [6]

The highest error can be seen in pattern 10 for [e+a]/ + . Most of these errors occur because the rule cannot identify whether the vowel belong to e pepet or e taling. The rule gives the correct answer for because it cannot differentiate the use of e pepet and e taling. In Rumi there is no problem to spell bena in e-pepet and e-taling because to spell it correctly we use both vowel e and a at first and second syllables but in Jawi to set apart e-pepet from etaling we can only use vowel .

[7] [8] [9] [10]

[11]



S-ar putea să vă placă și