Sunteți pe pagina 1din 135

Parts of Speech Part 2

ICS 482 Natural Language Processing


Lecture 10: Husni Al-Muhtaseb

ICS 482 Natural Language Processing


Lecture 10: Parts of Speech Part 2 Husni Al-Muhtaseb
2

NLP Credits and Acknowledgment


These slides were adapted from presentations of the Authors of the book
SPEECH and LANGUAGE PROCESSING: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition

and some modifications from presentations found in the WEB by several scholars including the following

NLP Credits and Acknowledgment


If your name is missing please contact me muhtaseb At Kfupm. Edu. sa

NLP Credits and Acknowledgment


Husni Al-Muhtaseb James Martin Jim Martin Dan Jurafsky Sandiway Fong Song young in Paula Matuszek Mary-Angela Papalaskari Dick Crouch Tracy Kin L. Venkata Subramaniam Martin Volk Bruce R. Maxim Jan Haji Srinath Srinivasa Simeon Ntafos Paolo Pirjanian Ricardo Vilalta Tom Lenaerts Heshaam Feili Bjrn Gambck Christian Korthals Thomas G. Dietterich Devika Subramanian Duminda Wijesekera Lee McCluskey David J. Kriegman Kathleen McKeown Michael J. Ciaraldi David Finkel Min-Yen Kan Andreas Geyer-Schulz Franz J. Kurfess Tim Finin Nadjet Bouayad Kathy McCoy Hans Uszkoreit Azadeh Maghsoodi Khurshid Ahmad Martha Palmer julia hirschberg Staffan Larsson Elaine Rich Robert Wilensky Christof Monz Feiyu Xu Bonnie J. Dorr Nizar Habash Jakub Piskorski Massimo Poesio Rohini Srihari David Goss-Grubbs Mark Sanderson Thomas K Harris John Hutchins Andrew Elks Alexandros Marc Davis Potamianos Ray Larson Mike Rosner Latifa Al-Sulaiti Jimmy Lin Giorgio Satta Marti Hearst Jerry R. Hobbs Andrew McCallum Christopher Manning Hinrich Schtze Nick Kushmerick Mark Craven Alexander Gelbukh Chia-Hui Chang Gina-Anne Levow Guitao Gao Diana Maynard Qing Ma James Allan Zeynep Altan

Previous Lectures

Pre-start questionnaire Introduction and Phases of an NLP system NLP Applications - Chatting with Alice Finite State Automata & Regular Expressions & languages Deterministic & Non-deterministic FSAs Morphology: Inflectional & Derivational Parsing and Finite State Transducers Stemming & Porter Stemmer 20 Minute Quiz Statistical NLP Language Modeling N-Grams Smoothing and N-Gram: Add-one & Witten-Bell Return Quiz 1 Parts of Speech

Today's Lecture

Continue with Parts of Speech Arabic Parts of Speech

Parts of Speech
Start with eight basic categories

Noun Verb preposition Pronoun

adjective adverb Article Conjunction

These categories are based on morphological and distributional properties (not semantics) Some cases are easy, others are not
8

Parts of Speech

Closed classes

Prepositions: on, under, over, near, by, at, from, to, with, etc. Determiners: a, an, the, etc. Pronouns: she, who, I, others, etc. Conjunctions: and, but, or, as, if, when, etc. Auxiliary verbs: can, may, should, are, etc. Particles: up, down, on, off, in, out, at, by, etc. Nouns: Verbs: Adjectives: Adverbs:
9

Open classes:

Sets of Parts of Speech: Tagsets


There are various standard tagsets to choose from; some have a lot more tags than others The choice of tagset is based on the application Accurate tagging can be done with even large tagsets

10

Some of the known Tagsets (English)

Brown corpus: 87 tags Penn Treebank: 45 tags Lancaster UCREL C5: 61 tags Lancaster C7: 145 tags

11

Some of Penn Treebank tags

12

Verb inflection tags

13

The entire Penn Treebank tagset

14

UCREL C5

15

Tagging

Part of speech tagging is the process of assigning parts of speech to each word in a sentence Assume we have

A tagset A dictionary that gives you the possible set of tags for each entry A text to be tagged A reason?

16

POS Tagging: Definition

The process of assigning a part-of-speech or lexical class marker to each word in a corpus:
WORDS
the driver put the keys on the table

TAGS
N V P DET

17

Tag Ambiguity (updated)


87-tagset Unambiguous (1 tag) 44,019 Ambiguous (2-7 tags) 5,490 45-tagset 38,857 8,844

2 tags 3 tags
4 tags 5 tags 6 tags 7 tags 8 tags 9 tags

4,967 411
91 17 2 (well, beat)

6,731 1621
357 90 32 4 (s, half, back, a) 3 (that, more, in)

2 (still, down) 6 (well, set, round, open, fit, down)

Most words are unambiguous 18 Many of the most common English words are ambiguous

Tagging: Three Methods

Rules Probabilities (Stochastic) Transformation-Based: Sort of both

19

Rule-based Tagging

Use dictionary (lexicon) to assign each word a list of potential POS Use large lists of hand-written disambiguation rules to identify a single POS for each word.

Example of rules: NP Det (Adj*) N

For example: the clever student

20

Probabilities: Tagging with lexical frequencies


Sami is expected to race tomorrow. Sami/NNP is/VBZ expected/VBN to/TO race/VB tomorrow/NN People continue to inquire the reason for the race for outer space. People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/NN for/IN outer/JJ space/NN Problem: assign a tag to race given its lexical frequency Solution: we choose the tag that has the greater P(race|VB) P(race|NN) Actual estimate from the Switchboard corpus: P(race|NN) = .00041 P(race|VB) = .00003
21

Transformation-based: The Brill Tagger

An example of Transformation-based Learning Very popular (freely available, works fairly well) A SUPERVISED method: requires a tagged corpus Basic idea: do a quick job first (using frequency), then revise it using contextual rules

22

An example

Examples:

It is expected to race tomorrow. The race for outer space. Tag all uses of race as NN (most likely tag in the Brown corpus)

Tagging algorithm:
1.

It is expected to race/NN tomorrow the race/NN for outer space

2.

Use a transformation rule to replace the tag NN with VB for all uses of race preceded by the tag TO:

It is expected to race/VB tomorrow the race/NN for outer space


23

Stochastic (Probabilities)

Simple approach

Disambiguate words based on the probability that a word occurs with a particular tag The best tag for given words is determined by the probability that it occurs with the n previous tags Trim the search for the most probable tag using the best N Maximum Likelihood Estimates (N is the number of tags of the following word)

N-gram approach

Viterbi Algorithm

Hidden Markov Model combines the above two approaches


24

Viterbi Maximum Likelihood Estimates


Want the most likely path through this graph.

noun

noun

noun

DT

aux

aux

verb

the

can

will

rust
25

Viterbi Maximum Likelihood Estimates


S1 S2 S3
RB NN VBN JJ TO DT NNP VB NN

S4

S5

VBD
VB

promised

to

back

the
26

bill

Viterbi Maximum Likelihood Estimates

We want the best set of tags for a sequence of words (a sentence)

W is a sequence of words W= w1w2w3..wn T is a sequence of tags T= t1t2t3..tn

P (W |T )P (T ) arg max P (T |W ) P (W )

P(w) is common
27

Viterbi Maximum Likelihood Estimates

We want the best set of tags for a sequence of words (a sentence)

W is a sequence of words W= w1w2w3..wn T is a sequence of tags T= t1t2t3..tn

arg max P (T |W ) P ( W |T )P (T )

P(w) is common
28

Stochastic POS Tagging: Example


1)

2)

Sami is expected to race tomorrow. People continue to inquire the reason for the race for outer space.

29

Stochastic POS Tagging: Example


Example: suppose wi = race, a verb (VB) or a noun (NN)?
Assume that other mechanism has already done the best tagging to the surrounding words, leaving only race untagged

1) Sami/NNP is/VBZ expected/VBN to/TO race/? tomorrow/NN 2) People/NNS continue/VBP to/TO inquire/VB the/DT reason/NN for/IN the/DT race/? For/IN outer/JJ space/NN
Bigram

ti arg max j P(t j | ti 1 ) P(wi | t j )


P(VB|TO) P(race | VB) P(NN|TO) P(race | NN)
30

Simplify the problem: to/To race/??? the/DT race/???

Where is the data?


Look at the Brown and Switchboard corpora
P(NN | TO) = 0.021 P(VB | TO) = 0.34

If we are expecting a verb, how likely it would be race


P( race | NN) = 0.00041 P( race | VB) = 0.00003

Finally:
P(NN | TO) P( race | NN) = 0.000007 P( VB | TO) P(race | VB) = 0.00001
31

Example: Bigram of Tags from a Corpus


Cat 0 0 ART N # at i 300 300 558 833 Pair 0, ART 0, N ART, N N, V # at i, i+1 Bigram 213 87 558 358 Prob(ART|0) Prob(N|0) Prob(N | ART) Prob(V | N) Estimate 0.71 0.29 1 0.43

N
N V V

833
833 300 300

N, N
N, P V, N V, ART

108
366 75 194

Prob(N | N)
Prob(P | N) Porb( N | V) Prob(ART | V)

0.13
0.44 0.35 0.65

P
P

307
307

P, ART
P, N

226
81

Prob (ART | P)
Prob (N | P)
32

0.74
0.26

A Markov Chain
0.74

0.71

ART
1

0.65

0 assume 0.0001 for any unseen bigram


0.29

0.43 0.35

0.26

N
0.44

0.13
33

Word Counts
N flies 21 V 23 ART 0 P 0 Total 44

fruit
like a the

49
10 1 1

5
30 0 0

1
0 201 300

0
21 0 2

55
61 202 303

flower
flowers birds others Total

53
42 64 592 833

15
16 1 210 300

0
0 0 56 558

0
0 0 284 307
34

68
58 65 1142 1998

Computing Probabilities using previous Tables


P(the | ART ) = 300/558 = 0.54 P(flies | N ) = 0.025 P(flies | V) = 0.076 P(like | V ) = 0.1 P(like | P) = 0.068 P(like | N) = 0.012 P(a | ART ) = 0.360 P(a | N ) = 0.001 P(flower | N ) = 0.063 P(flower | V ) = 0.05 P(birds | N) = 0.076

35

Viterbi Algorithm - Example


assume 0.0001 for any unseen bigram
Iteration 1

flies/V

7.6*10-6

Flies like a flower


0.00725

flies/N
NULL/0

flies/P

flies/ART

0
36

Viterbi Algorithm - Example


Iteration 2

flies/V

like/V

Flies like a flower


like/N

flies/N

like/P

like/ART
37

Viterbi Algorithm - Example


Iteration 2

flies/V

like/V

0.00031

flies/N

like/N

1.3*10-5

like/P

0.00022

Flies like a flower


like/ART 0
38

Viterbi Algorithm - Example


Iteration 3

like/V

a/V

flies/N

like/N

a/N

like/P

a/P

Flies like a flower

a/ART
39

Viterbi Algorithm - Example


Iteration 3

like/V

a/V

flies/N

like/N

a/N

1.2*10-7

like/P

a/P

Flies like a flower

a/ART

7.2*105

40

Viterbi Algorithm - Example


Iteration 4

like/V

flower/V

flies/N

a/N

flower/N

like/P

flower/P

Flies like a flower

a/ART

flower/ART
41

Viterbi Algorithm - Example


Iteration 4

like/V

flower/V 2.6*10-9

flies/N

a/N

flower/N

4.3*106

like/P

flower/P

Flies like a flower

a/ART

flower/ART
42

Performance

This method has achieved 95-96% correct with reasonably complex English tagsets and reasonable amounts of hand-tagged training data. Forward pointer its also possible to train a system without hand-labeled training data

43

How accurate are they?

POS Taggers boast accuracy rates of 95-99%

Vary according to text/type/genre Of pre-tagged corpus Of text to be tagged


Worst case scenario: assume success rate of 95%


Prob(one-word sentence) = .95 Prob(two-word sentence) = .95 * .95 = 90.25% Prob(ten-word sentence) = 59% approx

44

End of Part 1

45

Natural Language Processing

Lecture 10: Parts of Speech 2-2 Morphosyntactic Tagset Of Arabic - Husni Al-Muhtaseb
46

-
Shereen Khoja 177 177 tags
103 103 Nouns 57 57 Verbs 9 9 Particles 7 7 Residual 1 1 Punctuation

47



Masculine Feminine Neuter

three genders:

48


Three persons
The speaker The person being addressed The person that is not present
Singular Dual Plural
49

Three numbers


three moods of the verb
Indicative Subjunctive Jussive
Nominative Accusative Genitive

three case forms of the noun


50

51

Word

Noun

Verb

Particle

Residual

Punctuation

52

Word

Noun

Verb

Particle

Residual

Punctuation

Common

Proper

Pronoun

Numeral

Adjective

53

Word

Noun

Verb

Particle

Residual

Punctuation

Common

Proper

Pronoun

Numeral

Adjective

Personal

Relative

Demonstrative

54

Word

Noun

Verb

Particle

Residual

Punctuation

Common

Proper

Pronoun

Numeral

Adjective

Personal

Relative

Demonstrative

Specific

Common

55

Word

Noun

Verb

Particle

Residual

Punctuation

Common

Proper

Pronoun

Numeral

Adjective

Cardinal

Ordinal

Numerical Adjective
56

Word

Noun

Verb

Particle

Residual

Punctuation

Perfect

Imperfect

Imperative

57

Word

Noun

Verb

Particle

Residual

Punctuation

Subordinates

Answers

Explanations

Prepositions

Adverbial

58

Word

Noun

Verb

Particle

Residual

Punctuation

Conjunctions

Interjections

Exceptions

Negatives

59

Word

Noun

Verb

Particle

Residual

Punctuation

Foreign

Mathematical Formulae

Numerals

60

Word

Noun

Verb

Particle

Residual

Punctuation

Question Mark

Exclamation Mark

Comma

61

Arabic POS Tagger


Plain Arabic Text

ManTag

Training Corpus

DataExtract

Probability Matrix

Lexicons Untagged Arabic Corpus

Tagged Corpus

APT

62

DataExtract Process

Takes in a tagged corpus and extracts various lexicons and the probability matrix

Lexicon that includes all clitics.

(Sprout, 1992) defines a clitic as a syntactically separate word that functions phonologically as an affix

Lexicon that removes all clitics before adding the word

63

DataExtract Process

Produces a probability matrix for various levels of the tagset


Lexical probability: probability of a word having a certain tag Contextual probability: probability of a tag following another tag

64

DataExtract Process
N N 0.711 0.926 0.689 0.509 0.492 V 0.065 P 0.143 No. 0.010 Pu. 0.071

V
P No.

0.037
0.199 0.06

0.0
0.085 0.098

0.008
0.016 0.009

0.029
0.011 0.324

Pu.

0.159

0.152

0.046

0.151

65

Arabic Corpora

59,040 words of the Saudi al-Jazirah newspaper, dated 03/03/1999 3,104 words of the Egyptian al-Ahram newspaper, date 25/01/2000 5,811 words of the Qatari al-Bayan newspaper, date 25/01/2000 17,204 words of al-Mishkat, an Egyptian published paper in social science, April 1999

66

APT: Arabic Part-of-speech Tagger

LexiconLookup Arabic Words Stemmer

Words with multiple tags

StatisticalComponent

Words with a unique tag

67

68

5
]1. N [noun ]2. V [verb ]3. P [particle : ]4. R [residual () 5. PU [punctuation]: all

69


1.1. C [common] 1.2. P [proper] 1.3. Pr [pronoun] 1.4. Nu [numeral] 1.5. A [adjective]

70


Singular, masculine, accusative, common noun Singular, masculine, genitive, common noun

Singular, feminine, nominative, common noun


71


1.3.1. P [personal]
detached words such as attached to a word
to nouns to indicate possession to verbs as direct object prepositions

1.3.2. R [relative] 1.3.3. D [demonstrative]


72


Third person, singular, masculine, personal pronoun Singular, feminine, demonstrative pronoun

73

Relative Pronoun
1.3.2.1. S [specific] 1.3.2.2. C [common]

Dual, feminine, specific, relative pronoun Plural, masculine, specific, relative pronoun Common, relative pronoun
74


1.4.1. Ca [cardinal] 1.4.2.O [ordinal] 1.4.3. Na [numerical adjective]:

Singular, masculine, nominative, indefinite cardinal number


Singular, masculine, nominative, indefinite ordinal number


Singular, masculine, numerical adjective

75


Gender
M [masculine] F [feminine] N [neuter] Sg [singular] Du [dual] Pl [plural]

Person
1 [first] 2 [second] 3 [third]

Case
N [nominative] A [accusative] G [genitive]

Number

Definiteness
D [definite] I [indefinite]
76

Verbs
1. P [perfect] 2. I [imperfect] 3. Iv [imperative] ) ( ) (

First person, singular, neuter, perfect verb First person, singular, neuter, indicative, imperfect verb Second person, singular, masculine, imperative verb
77

Verbal Attributes Used


Gender
M [masculine] F [feminine] N [neuter]

Person
1 [first] 2 [second] 3 [third]

Number
Sg [singular] Pl [plural] Du [dual]

Mood
I [indicative] S [subjunctive] J [jussive]
78


1.1. Pr [prepositions] 1.2. A [adverbial] 1.3. C [conjunctions] 1.4. I [interjections] 1.5. E [exceptions] 1.6 N [negatives] 1.7. A [answers] 1.8. X [explanations] 1.9. S [subordinates]
79


Prepositions in : Adverbial particles shall : Conjunctions and : Interjections you : Exceptions Except : Negatives Not : Answers yes : Explanations that is : Subordinates if :
80

81

82

83

84

85

86

87

88


89


90

91

92

Parts of Speech
Part of Speech

Nouns

Adverbs

Verbs

Particles

Unique

Residual

Punctuation
93

1. Noun
Nouns ( N ) I. Type II. Definiteness III. Gender IV. Number V. Case

VI. Followship
VII. Variability VIII. Soundness

94

I. Type
Type

Common C-

Proper P-

Adjective J-

Numeral N-

Personal Pronoun- S-

Relative Pronoun R-

Demonstrative Pronoun D-

95

II. Definiteness
Definiteness

Definite D-

Indefinite - I-

96

III. Gender
Gender Masculine M- / Feminine F - / Unmarked U

97

IV. Number
Number

Singular 1 - 1
Dual 2 -2 Plural - 3 - 3 Sound S - Broken B - Mass M - Unmarked 4 - 3 Singular & Dual & Plural (man) A -

Dual & Plural (na, nahno) 98 T -

V. Case
Case
Nominative N Agent A Subject of cana C

Subject S

Duty Agent D Predicative of subject P Subject of cada K

Predicative of inn I

99


Case Accusative A Patient P Predicative of cada K State (manner) S Predicative of cana C Subject of inn I

Distinguative D - Cause U
100

Infinitave F

Case

Genitive G
Post preposition P
Adjunct (post noun) A

Case

Vocative V
101

VI. Followship
Followship

Assertion A

Coordinated C

Attributive T

Substitute S

102

VII. Variability
Variability

Invariable (static) / I

Variable V

Semi-Variable

/ S

Vowels W

Letters L
103

VII. Soundness
Soundness

Defective D

Sound S

Ending with ya Y

Ending with alif + hamza H

Ending with alif A


104

.. Type .. Adjective
Adjective J -

Degree

Positive P

Comparative C -

Superlative S -

105

Type . Numeral
Numeral N -

Function

Cardinal R

Ordinal O

Numerical adjective A

106

..Type Personal Pronoun


Personal Pronoun S -

Person

Attachment

First 1

Second 2 2

Third 3 3

Attached T 107

Detached D

Type Relative Pronoun


Relative Pronoun R -

Type

Specific F -

Common M
108

Example
<Noun , Common, Definite , Feminine, Singular , Nominative (Agent) , , Variable- Vowels, Sound> <N-C-D-F-1-NA--VW-S> < N C- I F- 3B AP- - V W- S>

109

< Noun , Personal Pronoun ,Definite , Feminine , Singular , Genitive post noun (Adjunct ), , Invariable (static) , , Third , attached < N S D F 1 GA I 3 T >

110

2. Adverbs
Adverbs D

Aspect

Case

Time T

Place P

Nominative

Accusative

Genitive

111

> < Adverb , Place ,Genitive ><D-P-G

112

3. Verbs
Verbs V I. Tens (Aspect) II. Gender

III. number

IV. Person

V. Case

VI. Conditional

VII. Voice

VIII. Variability

IX. Perfectness

X. Augmentation

XI. Amount

XII. Soundness

XIII. Transitivity 113

I. Tense
1. 2. 3.

Past ( P ( Present (Durative \ Future ).(R - ) Imperative (I )

II. Gender
1. Masculine ( M 2. Feminine (F - ) 3. Unmarked (U
)
114

III. Number

Singular (1-1) Dual (2 -2) Plural (3-3) Unmarked (4-4)

Singular & Dual & Plural : verb of (man) (A ) Dual & Plural : verb of (ma , nahno)
(T - )

115

IV. Person
1. 2. 3.

First (1-1). Second (2-2). Third (3-3).

V. Case
1. Indicative (( ) N - ) 2. Subjunctive (( ) A - )
Infinitive (( ) F - ) Non Infinitive (N )

3. Jussive ( () G - )

116

VI. Conditional
1. 2.

The condition (C The answer (A -

) )

V. Voice
1. Active ( A 2. Passive ( P ) )
117

VIII. Variability
1. 2.

Invariable (Static) ( I - ) Variable (V - )


Vowels (W - ). Letters (L - )

IX. Perfectness
1. Perfect (P - ) 2. Imperfect ( Can and cada ) ( I )
118

X. Augmentation

Augmentation ( A Non Augmentation (N

) -

XI. Amount

Trilateral ( T ) Quadric-Literal ( Q- ). Penta Literal (P-).


119

XII. Soundness

Defective (D

):

Initial ( I / Hollow (Meddle) (H-) / Last (L - ) / Initial + last ( T - ) / Hollow + Last (O - ) /

Sound ( S - )

120

XIII. Transitivity

Transitive ( T - )

One Patient

( O - )
( T - )

/ /

Two Patient

Intransitive (I - )

Agent only ( A - ) Agent + State or Distinquitor ( S - ) Nominal Sentence (N - )

121

< Verb , Past , Feminine , Singular , Third , Subjunctive non infinitive , , Active , Invariable (static ) , Perfect , Augmented , Trilateral ,Sound, intransitive Agent only > <V P F 1- 3 A N A I P- A T S IA>

122

4. Particles

) ( 1 - 1 Coordinating / / / / / / )Subordinating (2 - 2

) ) (c - / /( Contrast ) ) (E - / / / /( Exception ) . ) (I - / ( Initial

) (3 - 3 ) . /( Interrogative ) (4 - 4) . ( Preposition
123

Possibility (( ) 5). Protection ( ( ) 6) . Future ( / ( )7). Conditional ( / / / ()8). Answer ( / / / ()9). Exclamation (() 10 ). 11.Interjection/Introgative ( / / ( )11).

124

Negative ( / / / ()12). Imperative (Order))( (13). Cause ( / / / / ()14). Gerund ( / ()15). Deporticle (() 16). ( 17). Explanation (()18).
125

/ / / ( Assertion )(19). )). (20 /( Wishing )(21). ( Swearness

126

/
> < Particles , ta of Femininity <P - 17>. } { > <Particles , Swearness

><P - 21

127

5. Unique (U -

)
Unique U

Denominal D

The letters at the beginning Of some of the soar Of Al-Quran ) (

Past P

Present R

Imperative Order I
128

} )2( ) 1( {

<Unique , The litters at the beginning Of some of the soar Of Al-Quran >

<U - L>

129

6. Residual
Residual R

Type

Gender

Number

Case

Followship

Foreign F

Masculine M

Singular 1-1

Nominative N

Assertion A

Formula R F

Feminine

Dual 2-2

Accusative A

Coordinated C Attributive T

Acronym A U

Unmarked

Plural 3-3

Genitive G

Abbreviation B

Vocative
V

130

Substitute S

/ < Residual , Foreign , Feminine , Singular , Genitive Adjunct (post noun) >
<R F F 1 G A>

131

7. Punctuation (c )

? ! . ; -

Question Mark (Q- ). Exclamation Mark (X - Ellipsis ( E ) . Full Stop (F- ). Comma (C- ). Dotted Comma (D-). Hyphen (H-).

).

132

..

- Interspersion Marks (I- ). , The English Comma (G- ). , , Interspersion Marks (R- ). ( ) Brackets (B- ). " Quotation Marks (U- ). : Colon (O- ). [ ] Square Brackets (S- ). {} / slash (L- ).
133

><Punctuation , Full Stop

><C - F

134

To Part 2: Arabic POS

135

S-ar putea să vă placă și