Redefining Urdu Morphology

REDEFINING URDU MORPHOLOGY AND GRAMMAR FOR
THE DEVELOPMENT OF AN INTEGRATED SENTIMENT

ANALYSIS FRAMEWORK
AFRAZ ZAHRA SYED

2007-PHD-CS-07
SUPERSIVSED BY
DR. MUHAMMAD ASLAM
(2013)
Department of Computer Science and Engineering

University of Engineering and Technology
Lahore, Pakistan
Redefining Urdu Morphology and Grammar for the Development
of an Integrated Sentiment Analysis Framework
Dissertation
Submitted in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Computer Science
(2013)
AFRAZ ZAHRA SYED

2007-PHD-CS-07
SUPERSIVSED BY
DR. MUHAMMAD ASLAM

Lahore, Pakistan
ii
Redefining Urdu Morphology and Grammar for the Development
of an Integrated Sentiment Analysis Framework
A dissertation submitted in partial fulfillment of the requirements for the

degree of Doctor of Philosophy in Computer Science
By
Afraz Zahra Syed (2007-PhD-CS-07)
Approved on: ______________________________
Internal Examiner: __________________________
Dr. Muhammad Aslam

Assistant Professor,
Department of Computer Science and Engineering,
University of Engineering and Technology, Lahore, Pakistan.
External Examiner: __________________________
Dr. Farooq Ahmad

Associate Professor,
Faculty of Information Technology,
University of Central Punjab, Lahore, Pakistan.
_______________________________ _______________________________
Chairman, Dean,
Department of Computer Science and Engineering, Faculty of Electrical Engineering,
University of Engineering and Technology, University of Engineering and Technology,
Lahore, Pakistan. Lahore, Pakistan.
iii
This thesis has been evaluated by the following examiners:
External Examiners
a) From Abroad
1) Dr. Escalada Imaz, Gonzalo

Scientific Researcher
Researcher Council (CSIC) at the Artificial Intelligence Research
Institute(CSIC-IIIA)
Barcelona, Spain.
2) Dr. Muhammad Adeel Talib

Functional Architect,
Genix Ventures Pty Ltd,
Melbourne, Australia
3) Dr. Muhammad Tahir Abbas Khan

Associate Professor
Ritsumeikan Asia Pacific University,
1-1 Jumonjibaru, Beppu-shi, Oita 874-8577, Japan
b) From within the Country
Dr. Farooq Ahmad

Associate Professor
Faculty of Information Technology, University of Central Punjab
1 - Khayaban-e-Jinnah Road, Johar Town,
Lahore, Pakistan,
Internal Examiner
Dr. Muhammad Aslam

(Assistant Professor)
G. T. Road, Lahore, Pakistan
iv
v
ABSTRACT
The rise of social networking sites and blogs has simulated a bull market in personal opinion;
consumer recommendations, product reviews, ratings, and other types of online expressions. For
computational linguistic researchers, this fast-growing heap of information has opened an
exciting research frontier, referred as, the Sentiment Analysis (SA). For English, this area is
under consideration from last decade. But, other major languages, like Urdu, are totally
overlooked by the research community. Urdu is a morphologically rich and recourse poor
language. The distinctive features, like, complex morphology, flexible grammar rules, context
sensitive orthography and free word order, make the Urdu language processing a challenging
problem domain. For the same reasons, sentiment analysis approaches and techniques developed
for other well-explored languages are not workable for Urdu text.
This dissertation presents a grammatically motivated, sentiment classification framework to
handle these distinctive features of the Urdu language. The main research contributions are; to
highlight the linguistic (orthography, grammar and morphology, etc.) as well as technical
(parsing algorithm, lexicon, corpus, etc.) aspects of this multidimensional research problem, to
explore Urdu morphological operations, grammar and orthographic rules, to redefine these
operations and rules with respect to the requirements of sentiment analysis framework. The
orthographical, morphological, grammatical and finally the conceptual details of the language
are our target concerns. Additionally, our approach can help in the sentiment analysis of other
languages, like Arabic, Persian, Hindi, Punjabi etc.
The proposed framework emphasizes on the identification of the SentiUnits, rather than, the
subjective words in the given text. SentiUnits are the sentiment carrier expressions, which reveal
the inherent sentiments of the sentence for a specific target. The targets are the noun phrases for
which an opinion is made. The system extracts SentiUnits and the target expressions through the
shallow parsing based chunking. The dependency parsing algorithm creates associations between
these extracted expressions. The framework uses the sentiment-annotated lexicon based
approach. Each entry of the lexicon is marked with its orientation (positive or negative) and the
intensity (force of orientation) score. The experimentation based evaluation of the system with a
sentiment-annotated lexicon of Urdu words and two corpuses of reviews as test-beds, shows
encouraging achievement in terms of accuracy, precision, recall and f-measure.
vi
ACKNOWLEDGEMENTS
I believe the research work presented in this dissertation from conception to completion
is a blessing from my Allah, who answered to my parents’ prayers and blessed me with
the strength. I also want to express my deepest gratitude to several individuals:
 First and foremost, my utmost gratitude to Dr. Muhammad Aslam, my supervisor,
whose support and encouragement, I will never forget.
 Dr. Ana Maria Martinez-Enriquez, who guided me in writing good research papers
through her thoughtful comments and suggestions.
 Dr. Muhammad Ali Maud, Chairman of the Department of Computer Science and
Engineering, for his kind concern and consideration regarding my academic
requirements.
 My respectable teachers during the PhD course work for their guidance and
invaluable intellect.
 I am grate full to my colleagues and staff in the Computer Science and Engineering
Department.
 Mr. Waqaar who assisted me in implementation and testing phase.
 Lastly, I would like to thank my family for all their love and encouragement. My
parents, for being the excellent models of success and brilliance, who raised me with
a love of science and supported me in all my pursuits. My loving, supportive,
encouraging, and patient husband, Hasan, whose sincere support during all stages of
this Ph.D. gave me the feeling that I always had him on my side. Most of all, my
children Irtaza, Fatima and Ibrahim for their patience and tolerating my long study
hours.
vii
Dedicated to my Parents, Husband and Children
ix
TABLE OF CONTENTS
CHAPTER 1: INTRODUCTION 1
1.1. Research Motivation 2
1.2. Research contribution 4
1.3. The Problem of Sentiment Analysis 5
1.3.1. Targets of the appraisal 7
1.3.2. Sources of the appraisal 8
1.3.3. Appraisal expressions 8
1.3.4. Orientation 9
1.4. Sentiment annotated lexicon 9
1.5. Problem statement 10
1.6. System Evolution 11
1.7. Dissertation Outline 12
CHAPTER 2: STATE OF THE ART RESEARCH 14

2.1. Features of the given text 15
2.2. Techniques 17
2.3. Sentiment-annotated-lexicon construction 18
2.4. Generalization among domains 21
2.5. Processing Morphologically Rich Languages 22
2.6. Sentiment analysis and Urdu language processing 23
2.6.1. Word segmentation 24
2.6.2. Phrase Chunking 24
2.6.3. Stemming of complex morphology 25
2.6.4. Resources for Urdu language processing 25
2.6.5. Miscellaneous works 26
2.7. Adjective based sentiment analysis techniques 28
2.8. Term level vs. Phrase level polarity 30
2.8.1. Term-level-polarity based approaches 30
2.8.2. Phrase-level-polarity-based approaches 31
2.9. Negation Handling in sentiment analysis 32
CHAPTER 3: DISTINCTIVE FEATURES OF THE URDU LANGUAGE 34

3.1 Orthography 35
3.1.1. Character set 35
3.1.2. Word order 36
3.1.3. Bidirectional script 36
3.1.4. Ligatures 36
3.2. Parts of Speech 37
3.3. Vocabulary 38
3.4. Morphology 39
3.4.1. Inflection and derivation 40
3.4.2. Compounding 40
3.4.3. Reduplication 41
3.4.4. Compound verbs and verb phrases 41
3.5. Challenging features of the Urdu language 42
3.5.1. Corpus construction 42
3.5.2. Complex stemming 42
3.5.3. Intricate lexicon 42
3.5.4. Word boundary identification 43
3.5.5. Diacritics omission 44
3.5.6. Code switching 44
3.5.7. Independent case marking 44
3.5.8. Free word order 46
CHAPTER 4: SENTIUNITS: THE APPRAISAL EXPRESSIONS 47

4.1. Adjectives 48
4.1.1. Morphological structure of adjectives 50
4.1.2. Classes of adjective 53
4.2. Modifiers 56
4.3. Orientation 58
4.4. Intensity 58
4.5. Polarity 58
4.6. Negations 58
4.6.1. Negation in Urdu language 59
4.7. SentiUnit extraction model 61
4.8. The appraisal targets 62
4.8.1. Cases of noun phrases 63
4.8.2. Possession markers in noun phrases 63
4.8.3. Effect of complex noun phrases in Urdu text 64
CHAPTER 5: IMPLEMENTATION: CLASSIFICATION MODEL AND 66

LEXICON STRUCTURE
5.1. PREPROCESSOR 68
5.1.1. Diacritic omission 68
5.1.2. Word boundary identification 68
5.2. EXTRACTOR 71
5.3. ASSOCIATOR 72
5.3.1. Working of the ASSOCIATOR 72
5.3.2. Algorithm 73
5.4. CLASSIFIER 74
5.4.1. Working of the CLASSIFIER 74
5.4.2. Algorithm 75
5.5. Computation of SentiUnit polarity: Effect of polarity shifters 76
5.5.1. Computing overall review polarity Rp from SUp 78
5.6. Sentiment Annotated Lexicon 79
5.6.1. Definitions of the specific terms 80
5.6.2. Sentiment annotated lexicon of Urdu words 82
5.7. System integration 85
CHAPTER 6: EXPERIMENTATION AND RESULTS 88

6.1. Lexicon Coverage 88
6.2. Corpus 89
6.3. Case Studies 90
6.3.1. CASE 1: Part of speech tagging 90
6.3.2. CASE 2: Extraction of targets and SentiUnits 90
6.3.3. CASE 3: Case marking and complex noun phrases 93
6.3.4. CASE 4: Polarity annotations 95
6.3.5. CASE 5: Associating targets with SentiUnits 97
6.4. Results 98
6.4.1. Model A 99
6.4.2. Model B 100
6.4.3. Effect of Negation 101
6.5. Example illustrations 103
CHAPTER 7: CONCLUSIONS AND FUTURE DIRECTIONS 107
REFERENCES 111
LIST OF TABLES
1.1 Summary of the given review in terms of sentiment analysis 6

2.1 Features used and their respective contributions 16
2.2 Techniques used by different contributions. 18
2.3 Lexicon construction research. 20
2.4 Corpuses and lexicons for Urdu language. 21
2.5 Urdu language processing. 27
2.6 Research contributions related to adjective based sentiment analysis. 29
2.7 Term-level polarity vs. phrase-level polarity approaches. 31
2.8 Negation handling for sentiment analysis. 32
3.1 Brief overview of Urdu language 34
3.2 Different shapes of a single alphabet ‫(ج‬jeem). 37
3.3 Examples of Urdu words from multiple languages. 38
3.4. Examples of morphological processes in Urdu. 41
3.5 Inflection of multiple words from root word “‫ ”ﻋﻠﻢ‬in the Urdu language. 43
3.6 Examples of affixes, case markers and postpositions. 45
3.7 Free word order property of the Urdu text. 46
4.1 Examples of opinionated sentences from Urdu with different SentiUnits. 47
4.2 Examples of unmarked adjectives. 51
4.3 Adjective marking with gender and number 52
4.4 Marking of adjectives for cases 52
4.5 Adjective agrees with the nearest noun in a sequence. 53
4.6 Adjective with partial and full reduplication. 53
4.7 Descriptive adjectives in Urdu. 54
4.8 Attributive adjectives directly modify the nouns. 54
4.9 Predicative adjectives describe the features of the nouns. 54
4.10 Examples of possessive adjectives. 55
4.11 Examples of demonstrative adjectives. 55
4.12 Inflection of demonstrative adjectives. 56
4.13 Examples of reflexive possessive adjectives. 56
4.14 Adjective modifiers. 57
4.15 Examples of sentential negation from Urdu text. 60
4.16 Examples of constituent negation from Urdu text. 60
4.17 Possession markers in Urdu. 64
5.1 Examples of lexicon entries. 85
6.1. Summary of lexicon entries. 89
6.2 Corpora for evaluation. 89
6.3 Parsing of example 1 into targets and SentiUnits. 91
6.4 Parsing of example 2 into targets and SentiUnits. 91
6.5 POS tagging and phrase chunking of the given review. 92
6.6 Experimental results in terms of P, R, F and A for model A. 99
6.7 Comparison of accuracy from both corpora C1 and C2 for model A. 99
6.8 Experimental results in terms of P, R, F and A for model B. 100
6.9 Comparison of accuracy from both corpora C1 and C2 for model B. 100
6.10 Effect of negation in terms of P, R and F. 101
LIST OF FIGURES
3.1 Character set of Urdu. 35

3.2 Diacritics in Urdu with letter “‫”ب‬. 36
4.1 Types of adjectives in Urdu. 50
4.2 SentiUnit extraction and polarity computation. 61
4.3 Cases of noun phrases with core case markers 63
5.1 System model representing modules and their interactions. 67
5.2 Preprocessing of the input sentence by the PREPROCESSOR module 70
5.3 Processing of the input sentence by EXTRACTOR module 71
5.4 The dependency parsing of the given sentence 72
5.5 Linking SentiUnits with candidate targets by ASSOCIATOR module 73
5.6 Sentiment classification of a review as positive or negative. 77
5.7 Computation of the overall polarity of the Urdu text based review. 78
5.8 Structure of the lexicon sentiment annotated lexicon 84
5.9 Integration of the lexicon of Urdu words with the sentiment classifier 86
6.1 Example extraction of the SentiUnits. 93
6.2 Linking the sentiment expressions with candidate targets. 97
6.3 Example of a positive review. 103
6.4 Result of the analysis. 104
6.5 Example of a negative review. 105
6.6 Result of the analysis. 106
LIST OF PUBLICATIONS
1. Syed AZ, Muhammad A, Martínez-Enríquez AM (2012) Handling the Effect of

Polarity Shifters for a Morphologically Rich Language. In: An International
Interdisciplinary Journal in English, Japanese and Chinese. (Submitted)
2. Syed AZ, Muhammad A, Martínez-Enríquez AM (2012) Associating Targets with

SentiUnits: A Step Forward in Sentiment Analysis of Urdu Text. In: Artificial
Intelligence Review.
3. Syed AZ, Muhammad A, Martínez-Enríquez AM (2011) Sentiment Analysis of

Urdu Language: Handling Phrase-Level Negation. In: Proceedings of the
10thMexican international conference of artificial intelligence, pp 382–393
4. Syed, AZ, Muhammad A, Martinez-Enriquez, AM (2011) Adjectival Phrases as

the Sentiment Carriers in the Urdu Text. Journal of American Science 7(3), 644–
652
5. Syed AZ, Muhammad A, Martínez-Enríquez AM (2011) Sentiment-Annotated

Lexicon Construction for an Urdu Text Based Sentiment Analyzer. In: Pakistan
Journal of Science (2011), ISSN: 0030-9877
6. Syed AZ, Muhammad A, Martínez-Enríquez AM (2010) Lexicon based sentiment

analysis of Urdu text using SentiUnits. In: Proceedings of the 9th Mexican
international conference of artificial intelligence, Pachuca, Mexico, pp 32–43
Chapter 1| Introduction 1
CHAPTER 1
INTRODUCTION
The information in the world can be generally categorized into two key types: facts and
opinions.
 The facts are objective expressions describing events, entities and their characteristic
properties.
 The opinions are typically subjective expressions or appraisal expressions that
describe personal or individual sentiments or appraisals about events, entities and
their characteristic properties.
The notion of opinion is very broad. For this dissertation, our focus is on the opinions
generated by individuals, particularly the appraisal expressions which are given in the
form of web reviews.
These appraisal expressions in the form of opinions and subjective texts are very
important, and John Locke (1632-1704) very rightly said "Man is by nature a social
animal". It means that man always seeks for suggestions, opinions, and views, from other
people in society for his survival and proper decisions in every walk of life.
There are many areas of textual natural language processing like, information extraction,
summarization, information retrieval, text clustering, web search and text categorization
or classification. Little research had been done on the analysis of subjective texts for their
inherent sentiments, until only recently. One of the main reasons for the lack of study on
automatic analysis of subjective text is the fact that there was little opinionated text
available before the World Wide Web. People were used to take opinions for their friends
and relatives before taking any decision. Also, organizations were used to conduct polls,
surveys and focus groups whenever they wanted to find the opinions or sentiments of
their clients. But, in this modern era of computer and technology we are living in virtual
communities and societies. Now, internet forums, blogs, consumer reports, product
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
reviews, and other type of discussion groups have opened new horizons for human mind.
That is why, from casting a vote to buying a latest gadget people search for opinions and
reviews from other people on the internet. They can now give their reviews about
products at business sites and convey their views on almost anything in web forums,
blogs and discussion groups, which are collectively called the user-generated content.
This is not only true for individuals but also true for organizations and companies. For an
organization or company, it may no longer be compulsory to organize focus groups,
conduct surveys, or employ external consultants in order to get client or consumer
opinions regarding its products and those of its competitors. Now, the user-generated
content on the Web can easily provide such information.
Conversely, finding the right sources of the subjective texts and monitoring them on the
Web is still a difficult task, because there is a outsized number of diverse sources, and
each source may also have a massive volume of such text. In many cases, the opinions
are hidden in lengthy forum blogs and posts. It is hard and very time consuming for a
human reader to find relevant sources, take out related sentences, read them, analyze
them, and classify them into a usable form. Thus, automated opinion or sentiment
discovery and summarization systems are needed. This need has fashioned an exciting
rather new area in text analysis which is referred by many names like sentiment analysis,
opinion mining, subjectivity analysis, and appraisal extraction (Pang and Lee, 2008). For
this dissertation we use the term Sentiment analysis.
1.4. Research Motivation

There are two factors which motivated us to dig deep in this research direction:
Factor I Rapid Proliferation of Information: The Web 2.0 has emerged, as a platform
for the dynamic information exchange and the personal view propagation. Now, more
and more people around the globe express their feelings through blogs, give voice to the
governmental and political affairs through news reviews, and record their likes and
dislikes in the form of product reviews. This proliferation of the information has affected
the lives of the internet users both positively as well as negatively.
______________________________________________________________________________________
Framework.
On one side, the people use internet forums, blogs, consumer reports, product reviews,
and different types of discussion groups for taking everyday decisions. This text helps
them in almost all aspects of life from medical care to business proposals and from home
education to professional training.
On the other hand, the negative aspect of this opinion sharing cannot be ignored, which is
in the form of revolutionary or extremist propaganda. According to (Glaser et al. 2002)
the extremist groups use the Internet to endorse hatred and aggression. The Internet has
turned into a ubiquitous, anonymous, economical, and rapid way of communication for
such groups (Crilley, 2001). Now, people discuss each and every type of emotional
behavior in the web discussions and openly post their opinions. But, this information can
mislead the general public in their beliefs and thoughts, particularly children and youth
are more vulnerable.
Therefore, the analysis of user generated web content is not only useful for commercial
purposes, but also, its need for the discouragement of such misinformation is more
immediate, particularly, in the main languages of the world.
Consequently, the research on opinion mining and sentiment analysis on some Indo-
European languages, like, English, is flourishing and have a number of successful
contributions, (Turney 2002), (Pang et al. 2002), (Riloff et al. 2003), (Riloff and Wiebe
2003), (Tan et al. 2009) and (Bloom and Argamon 2010). They have used multiple
approaches and techniques to handle this flourishing area more effectively and most of
these contributions are very successfully performing the task of sentiment analysis. There
are now at least 20-30 companies that offer sentiment analysis services in USA alone
(Liu, 2010).
Factor II Urdu as a Morphologically Rich Language: Despite the fact that sentiment
analysis is a well explored field for English language, but, it is not yet decided whether
and how equivalent success could be attained for Morphologically Rich Languages
(MRLs) (Abdul-Mageed and Korayem 2010). The MRLs, are defined as, the languages,
in which, considerable information about the syntactic units and their relations is
expressed at word-level, i.e., the structures of the words are complex and morphological
operations like inflection and derivation are more frequent (Tsarfaty et al. 2010). Due to
______________________________________________________________________________________
Framework.
this word level complexity, the MRLs become more challenging for the computational
linguistics (CL) applications. This can result into intricate lexicons, complex stemming,
erroneous word segmentation and ambiguity in part of speech tagging etc. Urdu is a
worth mentioning case in this point.
Challenges in Urdu Language Processing: Given that, Urdu is a major language with
about 100 million speakers, there is a great potential in performing the sentiment analysis
on the Urdu text. As, the Urdu language is morphologically rich therefore, its constituent
words and phrases tend to be more complex, due to the recurrent derivations and
inflections. Besides, the morphological complexity, the variability in the grammar rules
and vocabulary in the Urdu text is usual and is considered acceptable. The main reason
for this phenomenon is that Urdu is influenced by many other languages, not only in
vocabulary but also in morphology and grammar, e.g., Hindi, Persian, Arabic, Sanskrit
and English, etc. The loanwords from a particular language follow their own grammar
rules. Hence, Urdu language has distinctiveness in features and linguistic aspects.
Moreover, it is altogether different from the well recognized languages in the field of
sentiment analysis and other computational linguistic applications. The computational
linguistics researchers require a comprehensive understanding of its linguistics as well as
computational aspects. Certain challenges which the Urdu language puts forward for
researchers are listed below and are explained in detail in Chapter 3.
 Optional use of diacritics causes misleading parts of speech tagging.
 Cursive script result into wrong word boundary identification
 Frequent inflection and derivation result into complex stemming
 Word level complexity makes lexicons more complex
 Flexibility in vocabulary and grammar makes it difficult to define spelling and
grammar rules.
 Free word order property cause misidentification of parts of speech tags.
1.5. Research contribution

On the basis of the two major factors of the research motivation we state here the main
contributions:
______________________________________________________________________________________
Framework.
 Sentiment analysis is a challenging computational linguistic or natural language

processing problem. Due to its remarkable significance for practical applications,
there has been an overpowering and irresistible growth of both, the research in
academia and commercial applications in the industry. Unfortunately, to date there is
no significant contribution, which addresses the problem of sentiment analysis for
Urdu language. Our contribution is the first in this field.
 This research performs a deep analysis and survey of the idiosyncratic characteristics
of the Urdu language, challenges posed by these characteristics and their possible
effects on the language processing research performed so far, which make it a worthy
reference for new researchers.
 The grammatically motivated model uses shallow parsing based chunking and very
successfully handles the challenging characteristics of the target language (as shown
by the results in Chapter 6).
1.6. The Problem of Sentiment Analysis

Although Urdu language is the object of investigation for this research work, but for
understanding the problem of sentiment analysis, we discuss a review given in English
language, to improve its understandability for the non native readers of this work.
Definition 1: (Liu, 2010) define the sentiment analysis as the automatic or computational
analysis of opinions, emotions and sentiments expressed in user-generated content on the
Web.
Definition 2: Sentiment classification is the classification achieved by the analysis of the
given text as positive or negative according to its inherent sentiments.
Example: To establish the problem, we take a laptop review segment; all the sentences
are numbered for later referencing:
“(1) Last month I bought a laptop. (2) It is such a fine manufacture. (3) Its
processing speed is really amazing. (4) The operating system is fantastic too. (5)
Although the battery life is not long, that is acceptable for me. (6) However, mother
is not happy with me as I did not tell her before I bought it. (7) She also thinks the
laptop is too costly, and wants me to replace it with a cheaper one. … ”
______________________________________________________________________________________
Framework.
Given text conveys the sentiments of a consumer’s (I) opinion about a product (laptop).
The consumer is the source of appraisal and product is its main target. But, when we look
at individual sentence then the targets of appraisal are different features of the main
target, even the sources are different too. There are quite a few opinions in this review.
Sentences (2), (3) and (4) express positive orientations of the inherent sentiments, while
sentences (5), (6) and (7) express negative emotions.
All the appraisals have some targets which mainly address the central or main target, i.e.,
the laptop. This main target is addressed indirectly through its features, like in sentences
(3), (4), and (5) the target features are “processing speed”, “operating system” and
“battery life”, respectively. The expression in sentence (7) is on the cost of the laptop, but
the opinion/emotion in sentence (6) is on the consumer “me” not the product. This is a
key point. In a review, the writer may be interested in opinions on various targets, but not
on all (e.g., improbable on “me”). The source of the appraisals in the sentences (2), (3),
(4) and (5) is the consumer himself, but in sentences (6) and (7) is “mother”. Table 1.1
summarizes this discussion:
Sentence Target Source Appraisal Orientation

Expression
(1) none None None objective
(2) laptop I such a fine Positive
(3) processing speed I really amazing Positive
(4) operating system I Fantastic Positive
(5) battery life I not long Negative
(6) me Mother not happy Negative
(7) laptop Mother too costly Negative
Table 1.1 Summary of the given review in terms of sentiment analysis.

With this case in mind, we now formally define the sentiment analysis. We start with the
sentiment target.
______________________________________________________________________________________
Framework.
1.3.1. Targets of the appraisal

In literature of sentiment analysis the terms object, target, entity and object are used to
represent the target entity that has been commented on. We use the term target only. The
targets of the appraisals can be anything; in general, these can be products, services,
individuals or personalities, organizations or businesses, event or happenings, or
discussion topics etc.
A target can have a set of features f these features represent the components or parts of
the target as well as its attributes or properties (Liu, 2010). For example the laptop’s
features under consideration are “its manufacture quality, processing speed, operating
system, battery life” and “cost”. Among these features “its manufacture quality”,
“processing speed” and “cost” are the attributes. Whereas, “operating system” and
“battery life” are its components.
Definition 3: In the given review a target is an entity about which the positive or negative
sentiments are expressed by the reviewer. This can be a person, product, topic, event, or
organization.
Example: In the above example target of overall review is the laptop, at the sentence
level its features act as targets, but the opinions made for these features indirectly address
the main target. In sentence (6), the targets deviate the analyzer to another target “me”.
From this example we make following two assumptions about the targets:
Assumptions:
1. The effect of positive or negative appraisal for all the features of a main target in a
given review is combined to make the final effect.
2. All the other targets are discarded along with their appraisal orientations.
Noun phrases as Targets: The targets of the appraisal are basically the non-overlapping
noun phrases in the given review. Noun phrases are the units of one or more words in a
link with noun as head word and all other words as dependents. Hence, the algorithm
extracts targets with nouns as head words. For this purpose it uses shallow parsing based
chunking. See Chapter 4 and Chapter 5 for more explanation.
______________________________________________________________________________________
Framework.
1.3.2. Sources of the appraisal

The sources of appraisal are also called opinion source, or opinion holder. In the case of
product reviews and blogs, the source of appraisal is usually the reviewer or the author of
the post. In this case, the presence of other sources very rarely effects the final
classification. But, in the more complicated texts like news articles these sources are
explicitly stated as a person or an organization that holds a particular opinion. For
example, the source of appraisal in the sentence “The president has disapproved the
political situation in the country” is “The president”.
Definition 4: In the given review a source of appraisal an opinion is the person or
organization that expresses the opinion.
Example: For sentence (3), (4) and (5) the source of appraisal is the reviewer; the main
source. In sentence (6) and (7) a secondary source “my mother” is introduced.
Assumptions:
 In the given review, only the main source is responsible for generating appraisal
expressions and opinions given by other sources are discarded.
1.3.3. Appraisal expressions

The appraisal expressions or opinions or subjective expressions are mostly based on
adjectives. Some adjectives are general and can be used to modify the features of the
targets of appraisal.
Definition 5: An appraisal expression modifies a feature with a positive or negative view,
attitude, emotion or appraisal from a source of the appraisal.
Example: For example the expression “really amazing” modifies the feature “processing
speed” of the laptop.
Assumptions:
1. The appraisal expressions are always adjective based, these can be single word based
(only adjective) or multiple words based (adjectival phrase).
2. In the given text only those appraisal expressions are considered which make an
association with a specific target.
______________________________________________________________________________________
Framework.
3. For a given review, only the appraisal expressions generated by the main source are
considered.
Appraisal Expressions as SentiUnits: In our approach (presented in next Chapters), we
label the appraisal expressions as the SentiUnits. For extraction of the SentiUnits the
algorithm first identifies the subjective words according to their orientation scores
(positive or negative). Then, it attaches the polarity shifters (words which shift the
polarity or orientation of the inherent sentiment, for more detail see Chapter 4),
conjunctions, postpositions and modifiers to extract the appraisal expressions from the
opinionated sentences. The shallow parsing based chunking is applied for the extraction
of the SentiUnits, with adjectives as the head words. The overall polarity of a sentence in
a given review can be determined by computing the polarity of these expressions. These
concepts are explained in detail in Chapter 4 and 5.
1.3.4. Orientation
The sentiment classification of the review starts from the word level. Each word is
classified as subjective or objective, further each subjective word is identified as positive
or negative. This positivity or negativity of the word or phrase is called its orientation.
The words with positive orientation exhibit positive sentiments or a supportive opinion.
This orientation can have certain force or strength called its intensity. For example, the
words “good” and “better” both have positive orientation, while the intensity of the later
one is more.
Definition 6: This positivity or negativity of the appraisal expression for a specific target
or its feature is called its orientation.
Example: In the given example appraisal expressions with positive orientation are “such
a fine”, “really amazing”and “fantastic”, while, “not long” and “not happy” exhibit
negative orientation.
1.4. Sentiment annotated lexicon

Our approach for sentiment analysis is lexicon based (entries are annotated with
orientation scores, represented as polarities). A usual model of a sentiment analyzer
______________________________________________________________________________________
Framework.
incorporates two components: (i) the classification algorithm, which analyzes and
classifies the given opinionated text according to inherent sentiments of the reviewer, and
(ii) the lexicon or lexicons annotated with the prior polarities of the lexical entries
(words/ phrases), usually as positive or negative. These prior polarity annotated lexicons
are also called sentiment-annotated lexicons (Pang and Lee, 2008).
Model: At the highest level, our lexicon model categorizes all the lexical entries into
objective and terms. Objective terms have no orientation or intensity and hence are not
marked with the prior polarity scores. Therefore, they demonstrate no effect on the
overall decision of the classification. On the contrary, subjective terms are the carriers of
the sentiments and are marked with polarity scores. Their occurrence can effect or even
altogether alter the final classification decision. With respect to orientation and polarity
the subjective terms are further categorized into three types (This model is explained in
detail in Chapter 5);
1. Absolute subjective terms with orientation only.
2. Subjective terms with intensity only.
3. Subjective terms with both values of orientation and intensity.
So far we have tried to explore the problem of the sentiment analysis by giving examples
and basic definition. Here, we define the research problem.
1.5. Problem statement

Let us denote the review under consideration as R in Urdu text. R is single sentence based
or it contains multiple sentences, among which some are subjective sentences (which
contain appraisal expressions, their targets and sources) in the set Ss= {Ss1, Ss2, Ss3,….Ssk}
and others are objective (without appraisal expressions, their targets and sources ) So=
{So1, So2, So3,….Sol}, such that,
R = {Ss1, Ss2, Ss3,…. Ssk } U {So1, So2, So3,… Sol.},
where,
k=1, 2, 3, …n;
l=1, 2, 3, …m;
n and m are finite numbers.
______________________________________________________________________________________
Framework.
AEXTRACTOR, and ASSOCIATOR modules of the system (presented in Chapter 5).

The final polarity of the review PR is calculated as a sum of all sentence polarities by the
CLASSIFIER:
PR = ∑ Psi ,
where
i=1, 2, 3, …N;
N is a finite number.
Research goal: Hence, the goal of this research is to develop an integrated sentiment
analysis model for the Urdu text. To achieve this entire goal we formulate following
objectives:
 To design and develop a sentiment-annotated Urdu lexicon; which includes
information about the subjectivity of an entry in addition to its orthographic,
phonological, syntactic, and morphological aspects. Unluckily, there is no such
lexicon available or even developed to date. Hence, from conception to modeling and
then implementation, we have to cope with this challenging task as a prerequisite for
the final system model (Syed et al. 2010), (Syed et al. 2012).
 To fabricate an appropriate classification model (which is capable of handling context
sensitive orthography, morphological operations and grammatical rules of Urdu
language) for the processing and classification of the text in accordance with the
inherent sentiments. The algorithms applied on other languages like English (Pang
and Lee 2008), (Wiebe et al. 2004), (Bloom and Argamon 2010) Chinese (Jang and
Shin 2010), or Arabic (Abbasi et al. 2008), cannot be applied directly to Urdu, due to
its morphological complexity and issues discussed in Chapter 3.
 To evaluate that model using the lexicon and compare the results. For the
experimentation, we use sentiment annotated lexicon of Urdu words and two corpora
of reviews about movies and electronic appliances as test-beds.
1.6. System Evolution

The sentiment-annotated lexicon based classifier presented in our paper (Syed et al.
2010) focuses on:
______________________________________________________________________________________
Framework.
Task 1: The extraction of the SentiUnits

Task 2: Computation of the polarity scores of the sentences according to the extracted
SentiUnits
Task 3: Classification of the review according to these polarity scores
This approach is good for handling the sentences with single targets. In other words, it
can only handle simple opinions in which all the opinionated expressions are associated
with one object or target. Presence of multiple targets, as in the comparative sentences,
where two different targets are compared, may lead to a misclassification error, e.g., “It is
hard to rank 300 among the outstanding movies like Brave Heart, or Ben-Hur.” In this
case, the analyzer may misclassify the comment. As, the expression, “outstanding” is
positive and is by default associated to the movie “300”, which is presented for review.
This is because the analyzer is not establishing an expression to target link. The positive
expression “outstanding” should be linked with the movies “Brave Heart and Ben-Hur”,
instead of the reviewed movie “300”.
To handle this kind of misclassifications in complex sentences like comparatives, we
extend this model to introduce the concept of SentiUnit to target associations. In this
approach, we emphasize on the exact identification of the SentiUnits as well as their
targets. To minimize misclassification rate these targets are associated with the
SentiUnits. For this purpose, we incorporate a new module called the ASSOCIATOR. The
EXTRACTOR module uses shallow parsing based chunking to extract the SentiUnits and
the targets (Syed et al. 2010). The ASSOCIATOR module uses the dependency parsing
based algorithm to associates each SentiUnit with its respective target (Syed et al. 2012).
After implementation of the final version, we evaluate the system on the corpus of the
reviews about movies and electronic appliances. We use four classification performance
metrics, i.e., precision, recall, and F-measure in addition to accuracy. In comparison, with
the previous versions, the results are radically improved with an accuracy of 82.5%,
particularly for sentences with multiple targets.
1.7. Dissertation Outline

The chapter wise division of this research dissertation is given below:
______________________________________________________________________________________
Framework.
Chapter 2 gives a comprehensive overview of the state of the art research in the field of
sentiment analysis and Urdu language processing. It discusses features, approaches, and
techniques used for the development of the sentiment analyzer at different levels for
different languages.
The complete overview of the Urdu language, which is the main object of this research, is
given in Chapter 3. As Urdu is an entirely different language from some well explored
languages like English, therefore, we explain its characteristic features like, orthography,
morphology, syntax and grammar in more detail to augment the understandability of the
next chapters.
Chapter 4 describes the concept of the SentiUnits or the appraisal expression. Some
examples and their description augment the explanation of the structure of the SentiUnits.
The overall system’s implementation, modules and their diagrams are given in Chapter 5.
This chapter also explores the construction, integration and model of the sentiment
annotated lexicon of Urdu words.
Chapter 6 presents experimentation and results. For performance evaluation of the
sentiment analysis systems, the experiments are performed on real corpuses of user
reviews. For this purpose, reviews corpuses are collected and sentiment annotated
lexicons are developed.
Finally, Chapter 7 concludes our research contribution with some discussion points and
indications of the future endeavors.
Chapter review:
In this Chapter we defined some basic terminologies used in the task of sentiment
analysis. Using these terms we formulated our problem statement, stated the objectives,
goals and main contributions of the research.
______________________________________________________________________________________
Framework.
Chapter 2| State of the Art Research 14
CHAPTER 2
STATE OF THE ART RESEARCH
The field of sentiment analysis is the center of attention for the researchers from
information retrieval, data mining, computational linguistics, and many other related
areas. There is a rapid growth of interest and the foregoing efforts have covered a broad
range of the tasks, for example, polarity classification (Pang et al. 2002), (Turney 2002),
opinion identification (Pang and Lee 2004), and opinion source assignment (Breck et al.
2007), (Choi and Cardie, 2008). Additionally, these contributions have attempted the
problem at different granularity levels. For instance, the contribution in (Pang et al. 2002)
attempts sentiment classification task at the document level. (Pang and Lee 2004)
explores sentence level classification while, (Turney 2002), (Choi and Cardie, 2008)
emphasize on phrases. The literature survey given in this Chapter covers major aspects of
SA research and also gives detailed overview of the contributions done for the language
processing of the morphologically rich languages with Urdu as a special focus.
To present a precise a literature survey for SA and Urdu language processing, we focus
on the following major aspects:
1. Features of the given text
2. Techniques
3. Sentiment annotated lexicon construction
4. Generalization among domains
5. Processing of Morphologically Rich Languages
6. Urdu Language Processing
7. Adjective based SA techniques
8. Term level vs. phrase level polarity
9. Negation Handling in SA
______________________________________________________________________________________
Framework.
2.7. Features of the given text
Researchers have focused on a number of features of the given text for achieving better
classification results. These features are encoded into feature vectors for the proper
application of machine learning algorithms (Pang and Lee 2008). Thus, feature selection
is a critical task and can affect the results to a great extend. Syntactic, semantic, linking
based, term based, topic oriented and part of speech based features are frequently used in
literature. In following, we discuss four categories, which are Part of speech (POS) based,
term based, syntactic, and topic oriented.
The POS based information, particularly, of adjectives, can help a lot in sentiment
analysis. That is why the earliest work in this domain uses adjectives as subjectivity
indicators (Hatzivassiloglou and McKeown 1997). After that, (Hatzivassiloglou and
Wiebe 2000; Mullen and Collier 2004), and (Whitelaw et al. 2005) present their
approaches to handle adjectives using multiple techniques. (Turney 2002) argues that,
proverbs are also carriers of sentiments in a sentence and should be considered in
combination with adjectives. The sentences are divided into pre-structured grammatical
patterns, which include adjectives and adverbs as the core words. (Riloff et al. 2003)
attempts a relatively new idea and proposes the analysis of nouns in the text. It
emphasizes on the concept of subjective nouns and computes the orientation for the
phrases in the sentence which contained them.
Many works are available in which term based features are considered. For example, the
position of the term in a sentence is put forward as a feature by (Kim and Hovy 2006).
This work locates the specific terms, and then, according to their position, it computes
subjectivity orientation. Another work, (Wiebe et al. 2004) applies the concept of hapax
legomena for feature selection, which means, a word occurring only once in a given
corpus. It proposes that the word that appear only once in the corpus are more subjective
than the others. In addition to this feature, it uses a relatively complex syntactic feature,
i.e., collocations of the words in a sentence. If some words or terms co-occur more
frequently than usual, then, these are considered as collocations. According to (Yang et
______________________________________________________________________________________
Framework.
al. 2006) the terms which are rare and are not entered in a prefixing dictionary tend to be
more subjective, because, the reviewers use them to emphasis their opinion.
Table 2.1
Features used and their respective contributions.
Type Focused features Contributions
Term based Term presence and position Pang et al. (2002)

Bigrams and trigrams Dave et al. (2003)
Hapax legomena Wiebe et al. (2004)
Rare terms for emphasis Yang et al. (2006)
Tem position Kim and Hovy (2006)
Term frequency Abdul-Mageed, Korayem (2010)
Contrastive distance in terms Kennedy, Inkpen (2006)
Snyder, Barzilay (2007)
Syntax based Collocations Riloff, Wiebe (2003)
Wiebe et al. (2004)
Appraisal expressions Whitelaw et al. (2005)
Valance shifters Kennedy, Inkpen (2006)
Noun adjective dependency Bloom, Argamon (2010)
POS based Adjectives Hatzivassiloglou McKeown (1997)
Hatzivassiloglou Wiebe (2000)
Mullen, Collier (2004)
Whitelaw et al. (2005)
Adjective and adverb Turney (2002)
Subjective noun Riloff et al. 2003
Topic Reference to the topic Mullen, Collier (2004)
______________________________________________________________________________________
Framework.
(Pang et al. 2002) states better performance, using “presence of term” as a binary-valued
feature vector, whose entries merely specify, whether a term occurs (0, 1) or not. But, in a
term frequency feature vector entry values increase with the occurrence frequency of the
corresponding term (Abdul-Mageed and Korayem 2010). Bigrams and trigrams are used
by (Dave et al. 2003). (Kennedy and Inkpen 2006) and (Snyder and Barzilay 2007)
consider contrastive distance between terms as an automatically computed feature.
(Whitelaw et al. 2005) uses the concept of appraisal theory and extracts appraisal
expressions with the help of sentiment lexicon. (Mullen and Collier 2004) observes that,
the sentences which contain a reference to the topic, can be considered more important.
For this purpose, it specifies words and word phrases which, can be extracted as
indicators of the reference. The above discussed features and related contributions with
some further examples are summarized in Table 2.1.
2.8. Techniques
There are a number of techniques used for sentiment analysis, e.g., unsupervised
bootstrapping, sentiment lexicon and support vector machines (see Table 2.2.). In
unsupervised bootstrap approach, a primary or initial classifier is applied on the text to
generate labeled data as the output. After that, a supervised learning algorithm may be
applied on this data. The initial classifier can have various implementation possibilities,
according to the language complexity and depth of the required analysis. An example of
such an initial high-precision classifier to learn extraction patterns for subjective terms is
proposed by (Riloff and Wiebe 2003). (Kaji and Kitsuregawa 2007) uses this method for
the automatic construction of HTML documents based corpus in which, the polarity
labels are assigned to the entries.
(Hatzivassiloglou and Wiebe 2000; Turney 2002; Yu and Hatzivassiloglou 2003; Riloff
et al. 2003) and (Higashinaka et al. 2007) employ sentiment-annotated lexicon induction
technique. As a first step, an unsupervised approach is applied for the generation of a
sentiment-annotated lexicon. Then using this as a resource, the given text is classified as
positive or negative.
______________________________________________________________________________________
Framework.
(Hu and Liu 2004) and (Andreevskaia and Bergler 2006) use Preston WordNet for
extraction of sentiment tags. There is also a trend in research community to extend
existing lexicons, e.g. SentiWordNet is an extension of the WordNet.
Table 2.2.
Techniques used by different contributions.
Technique Used Contributions
Unsupervised bootstrapping Riloff, Wiebe (2003)

Kaji, Kitsuregawa (2007)
Sentiment annotated lexicon Hatzivassiloglou Wiebe (2000)
Turney (2002)
Yu, Hatzivassiloglou (2003)
Riloff et al. (2003)
Higashinaka et al. (2007)
Support vector machines (SVM) Pang, Lee (2002)
Dave et al. (2003)
Pang, Lee (2004)
Kennedy, Inkpen (2006)
WordNet based Hu, Liu (2004)
Andreevskaia, Bergler (2006)
2.9. Sentiment-annotated-lexicon construction
As we are using the lexicon based approach for the development of the sentiment
analyzer so we discuss here some contributions from this aspect of the research. Lexicon
construction with an apposite coverage is a challenging task. From definition of grammar
rules to their appropriate implementation, it requires much expertise and proficiency
about the target language as well as the computer algorithms. For the task of sentiment
analysis the entries of these lexicons are annotated with the orientation scores in addition
______________________________________________________________________________________
Framework.
to their morphological, grammatical and phonological information. This sentiment

annotation task can either be done manually with the help of the agreement of judges who
can decide about the orientation scores of the given words. Or, it can be done
automatically, using computer algorithms like machine learning approaches etc. The
manual annotation, provides higher accuracy but is more time consuming and lengthy.
The languages, which are more popular on the internet, have rich and easily available
electronic-linguistic-resources. For example, English language, for which almost all types
of corpora are available from almost all domains, i.e., from product reviews to news
discussions. That is why the sentiment analysis research community has moved to the
algorithms and approaches which can help in the generation of the automatic lexicons as
an alternative of manual annotation and tagging. For example, (Annett and Kondrak
2008; Higashinaka et al. 2007; Andreevskaia and Bergler 2006; Hu and Liu 2004; Yu and
Hatzivassiloglou 2003; Riloff et al. 2003; Turney 2002) and (Hatzivassiloglou and Wiebe
2000). These methods are fast and can rapidly develop domain dependent lexicons.
Going back to the history of sentiment annotated lexicon construction, General Inquirer
(Stone et al. 1966) is a popular recourse for sentiment analysis of English language and is
manually compiled. A pioneering attempt in automatic acquisition of sentiment
annotated-lexicon is (Hatzivassiloglou and McKeown 1997). This work develops a
sentiment-annotated lexicon with an emphasis on adjectives. They apply shallow parsing
algorithm and developed a log-linear statistical model. This model predicts same
orientation between any two adjectives. After that automatic acquisition of the polarity
values of words and phrases itself appeared as an active line of research. Diverse
techniques have been proposed and implemented for learning the word polarities. These
include corpus-based approaches like (Hatzivassiloglou and McKeown 1997), statistical
approaches to measures of the word association etc as proposed in (Turney and Littman
2003) and using lexical relationships (Kamps et al. 2004).
Some efforts have tried to use or extend the existing lexicons, e.g. the extension of
WordNet is SentiWordNet. In SentiWordNet the polarity marks are annotated with the
existing structure of the gloss. (Annett and Kondrak 2008), (Andreevskaia and Bergler
______________________________________________________________________________________
Framework.
2006) and (Hu and Liu 2004) utilize WordNet or its extensions for the sentiment analysis.
Moreover, (Hatzivassiloglou and Wiebe 2000; Turney 2002; Yu and Hatzivassiloglou
2003; Riloff et al. 2003) and (Higashinaka et al. 2007) have tried to develop algorithms
and techniques for automatic lexicon construction using unsupervised learning methods.
All these discussed contributions are summarized in Table 2.3.
Most of these efforts use pre-developed linguistic recourses like corpuses for the
development and extraction of required lexicons. But, Urdu is a recourse poor language
and hence the task of lexicon construction becomes more difficult and time consuming.
To our knowledge no such lexicon exists for Urdu text. However, there are a very few
efforts who have tried to construct corpuses and simple lexicons for other NLP
applications.
Table 2.3.
Lexicon construction research.
Research focus Contributions
Manually compiled Stone et al. (1966)
Corpus based Hatzivassiloglou and McKeown (1997)

Turney and Littman (2003)
Kamps et al. (2004)
Extension of existing lexicons Annett and Kondrak (2008)
Andreevskaia and Bergler (2006)
Hu and Liu (2004)
Unsupervised learning methods Hatzivassiloglou and Wiebe (2000)
Turney (2002)
Yu and Hatzivassiloglou (2003)
Higashinaka et al. (2007)
______________________________________________________________________________________
Framework.
The preliminary work is presented for the EMILLE (Enabling Minority Language
Engineering) project in the form of a multi-lingual corpus for the South Asian languages.
A parallel corpus for Hindi, Urdu, English, Bengali, Punjabi and Gujarati languages
contains about 200,000 words (Baker et al. 2003). Their independent corpus of Urdu text
has 1,640,000 words annotated with POS tags (Hardie 2003).
Another effort is presented in (Ijaz and Hussain 2007). They use corpus to automatically
develop Urdu lexicon. Their corpus is based on cleaned text from news websites,
containing about 18 million words. The work (Muaz et al. 2009), gives brief analysis of
parts of speech of Urdu language and develops a POS tagged corpora, whereas, another
effort (Mukund et al. 2010) generates semantic role labeled corpus for Urdu text using
cross lingual projections. (Humanyoun et al. 2007) presents the extraction and
development of the automatic extraction of Urdu lexicon using corpus. Table 2.4 shows
the corpuses and lexicons developed for Urdu language for different applications of NLP.
Table 2.4.
Corpuses and lexicons for Urdu language.
POS tagged corpora Hardie (2003)

Muaz et al. (2009)
Corpus based lexicon construction Ijaz and Hussain (2007)
Humanyoun et al. (2007)
Semantic role labeled corpus Mukund et al. (2010)
2.10. Generalization among domains
The generalization of sentiment analysis solutions, among multiple domains is still an

open issue. The term domain adaptation is coined by the SA community (Tan et al. 2009)
to refer to the development of a generalized solution which can be applied on all the
potential target domains. Most of the contributions for opinion mining are highly domain
______________________________________________________________________________________
Framework.
specific (Pang and Lee 2004). (Tan et al. 2009) handles the domain adaptation issue using
frequency co-occurring entropy (FCE) method. It emphasizes on a smooth transformation
from a domain d1 to another domain d2 through a set of generic features F, representing
d1 and d2. It evaluates the model for six domains and finally concludes that FCE is not the
best option. Another feature related to multiple domains is their complexity level.
Sentiment analysis of reviews related to products and movies is considered as the easiest
in literature (Pang and Lee 2004) and these reviews serve as a test bed for most of the
approaches. On the contrary, political speeches and discussions are perhaps the most
complex to handle. (Bansal et al. 2008) pinpoints an issue and evaluates whether the
speech is in favor or opposition.
2.11. Processing Morphologically Rich Languages
The Morphologically rich languages or MRLs are challenging domain for NLP
researchers. Still there are a number of worth mentioning contributions. For example, a
stemming model for classical Arabic in Holly Quran is presented by (Thabet 2004). This
work uses the stop-word list and makes lists of words from every surah. Both lists are
compared and when some words in the created list do not exist in the stop-word list, then
the algorithm remove the prefixes. The accuracy of the algorithm is 99.6% for the prefix-
stemming and 97% for the post-fix-stemming. In (Paik and Parui 2008) presents a general
analysis of the languages spoken in India, particularly, Marathi, Hindi, and Bengali. In
this work different similarity classes are made of all the lexical entries by using the match
of the prefix. This match is done with respect to a predefined length. Another stemmer
for Hindi Language is proposed by (Kumar and Siddiqui 2008), which compute n-grams
of the words with the given length. The algorithm treats these n-grams as the postfixes
and extracts the possible stems with postfixes. Finally, the combination of postfix and
stem with maximum probability is picked with a reported accuracy of 89.9%. A Telgu
language based stemmer in (Akram et al. 2009), presents the statistical techniques and
suggests that this MRL require deeper linguistic analysis for improved results.
______________________________________________________________________________________
Framework.
Orthographically and grammatically Urdu and Persian language have many similarities.
This is because of a large number of vocabulary matching. (Sharifloo and Shamsfard
2008) present a rule-based bottom up algorithm for stemming of Persian text. The
algorithm first extracts the core substring of the words, and compares them with already
defined cores using some grammar rules. This matching of the strings is done by the
already defined morpheme clusters. Moreover, the accuracy is enhanced to about 90.1%
by applying an anti-rule-procedure.
There are some worth mentioning contributions for handling sentiment analysis in MRLs
For example, (Abdul-Mageed and Korayem 2010) and (Abbasi et al. 2008) for Arabic,
and (Jang and Shin 2010) for Chinese language, etc. The work presented in (Abdul-
Mageed and Korayem 2010) is for sentiment analysis of the Arabic text. In this work, the
main focus is on the Arabic text related issues for the development of a practical analyzer
with acceptable performance. It analyzes news text by automatic classification at the
sentence level. It applies a support vector machines classifier. Another related work is
(Abbasi et al. 2008). It performs sentiment analysis of Arabic and English web forums.
Its emphasis is on the extremist opinion propagation. For handling Arabic language’s
characteristics, it proposes specific feature extraction components. It develops Entropy
Weighted Genetic Algorithm (EWGA), a hybridized genetic algorithm that incorporates
the information gain heuristic for feature selection, i.e., stylistic and syntactic features.
This algorithm improves the system performance by selecting better key features.
2.12. Sentiment analysis and Urdu language processing
Due to the idiosyncratic linguistic features of the Urdu language and an exclusive set of
morphological and grammatical rules, computer based processing of Urdu is not a very
well explored dimension. There are our contributions in the subjectivity or sentiment
analysis of the Urdu text (Syed et al. 2010; Syed et al. 2011; Syed et al. 2012).
In this section, we present a brief survey of the major NLP contributions for Urdu
language, which are useful for sentiment analysis;
______________________________________________________________________________________
Framework.
2.6.1. Word segmentation

For all computational linguistics application the first task is to segment the sentence into
word segments, or the accurate identification of the word boundaries. Due to cursive and
context sensitive orthography of the Urdu language this word segmentation is not as
trivial as for English or French, where the word boundaries are identified through white
spaces. (Durrani and Hussain 2010) identify this segmentation issue as a major problem
in the accurate processing of the text and give a detailed discussion about the discovery
of the inherent causes. This discussion is concluded by providing a word segmentation
model. (Lehal 2010) and (Lehal 2009) also provide an algorithm in this regard.
2.6.2. Phrase Chunking
A variety of phrases exist in Urdu including verb phrases, noun phrases, and adjectival
phrases. Identifying these phrases in a sentence is very helpful for various applications in
NLP, like, information retrieval or extraction, parsing, sentiment analysis, machine
translation, and question answering. The procedure which directly tags these phrases is
called phrase chunking or simply chunking. For Urdu phrase chunking, a very prominent
contribution is (Ali and Hussain 2010), which describes the structure of Urdu verb
phrases, and applies a series of experiments to automatically label them. It uses a
manually tagged corpus of 100,000 Urdu words with verb phrase chunk tags. The
reported results of this effort give 98.44% accuracy. It uses a hybrid approach with
extended tag set.
As Urdu and Hindi are very similar morphologically, so we discuss two contributions for
phrase chunking from Hindi text (Singh et al. 2005; Dalal et al. 2006). In the former,
HMM based chunk tagger is presented for Hindi language. The chunk tagging is divided
into two sub tasks: the identification of the chunk boundaries and then labeling of the
chunks according to their types. In (Dalal et al. 2006) Hindi tagger uses a statistical
approach based on maximum entropy. Simultaneously various features are used for the
prediction of the word tags. The proposed feature set is largely classified as the set of
dictionary of context-based features, word features, and corpus based features. A corpus
______________________________________________________________________________________
Framework.
of more than 35,000 words/phrases is used for testing and training, reporting an accuracy
of 87.4%.
2.6.3. Stemming of complex morphology
Urdu is rich in both derivational and inflectional morphology. For example, the verbs
inflect to agree with case, number, respect, and gender. Also the verbs is inflected by the
mood (e.g., imperative, infinitive), tense (e.g., present, past), habitual. In, (Akram et al.
2009) states that only the verbs in Urdu have sixty inflected variations. Moreover, the
adjectives also show agreement for case, number, and gender. (Syed et al. 2012)
describes this phenomenon in detail. The intense inflectional and derivational behavior of
Urdu, entails the stemming of the Urdu text a quite challenging process, because the
stemming become harder to devise as the character encoding, morphology, and script of
the language becomes more intricate. For example, Italian language has more inflections
so the stemming is more complex than that of English.
Arabic is also a MRL, so the stemming task becomes even harder. (Riaz 2010) suggests
that Arabic and Farsi stemming process cannot be used for Urdu due to the inflections,
producing erroneous results. Besides, dictionary/lexicon based error correcting schemes
used by other stemmers cannot be applied to Urdu because of the dearth of machine-
readable resources. An Urdu stemmer (Akram et al. 2009) focus on a rule based
approach, which removes the prefix and the postfix before adding letter or letters to
generate the surface from the stem. The exception lists are created and used to complete
the first two steps of the algorithm. If the lookup is successful then the stripping process
is bypassed. (Riaz 2010) describes the challenges related to the Urdu stemming and
proposes a rule-based model with a few rules implemented to stimulate the intricacies.
2.6.4. Resources for Urdu language processing
For Indo-Aryan languages like Urdu, there are merely a little lexical resources available
and accessible for performing research. For example, for Hindi language a lexical
______________________________________________________________________________________
Framework.
recourse like English WordNet is presented as Hindi Wordnet (Bhattacharyya et al. 2008;
Bhattacharyya 2010). The methodology and architecture of this resourse is based on the
English WordNet (Fellbaum 1998). Urdu WordNet (Ahmed and Hautli 2010) is
developed by using the same approach. As far as corpus construction is concerned the
Enabling Minority Language Engineering project is a considerable attempt. It is
functioning on a multi-lingual corpus for the South Asian Languages. An independent
parts-of-speech-tagged corpus for Urdu text is developed with about 1,640,000 words
(Hardie 2003). Another parts-of-speech-tagged corpus is presented by (Muaz and
Hussain 2009). (Humanyoun et al. 2007) presents the extraction and development of the
automatic extraction of Urdu lexicon using corpus. Also, (Ijaz and Hussain 2007)
presents the development of an Urdu lexicon from the given corpus. The corpus is based
on cleaned text from Urdu news websites, having nearly 18 million words.
(Hualti and Butt 2011) describes a computational semantic analyzer as part of the parallel
grammar project and is based on the syntactic analysis done for the Urdu grammar
component of the ParGram. In addition to the semantic construction some peripheral
lexical resources like a preliminary Urdu WordNet and a VerbNet are developed and
integrated with the main model. Such resources help to generate a more comprehensive
representation of lexical knowledge, e.g., hyponyms for words and their thematic roles.
2.6.5. Miscellaneous works
There is some other worth mentioning contributions in Urdu NLP. For Example,
(Mukund et al. 2010) employ cross lingual projections in the PropBank paradigm for the
automatic induction of the semantic role annotations for the Urdu text. These annotations
are done on the basis of the word alignments. An Urdu-English parallel corpus is used by
the projection model to utilize syntactic as well as lexical information. The reported
accuracy of the annotations is 92% on short sentences.
______________________________________________________________________________________
Framework.
Table 2.5.
Urdu language processing.
Word Segmentation Durrani, Hussain (2010)

Lehal (2010)
Lehal (2009)
Phrase Chunking Ali and Hussain (2010)
Singh et al. (2005)
Dalal et al. (2006)
Stemming Akram et al. (2009)
Riaz (2010)
Resources Bhattacharyya et al. (2008)

Bhattacharyya (2010)
Fellbaum (1998)
Ahmed and Hautli 2010)
Hardie (2003)
(Hualti and Butt (2011)
Muaz and Hussain (2009)
Humanyoun et al. (2007)
Ijaz and Hussain (2007)
Analysis (Hualti and Butt (2011)
Mukund et al. (2010)
Rizvi and Hussain (2005)
Mukund and Ghosh (2011)
Mukund and Srihari (2010)
______________________________________________________________________________________
Framework.
(Rizvi and Hussain 2005) describe computational investigation of different Urdu parts of
speech. Their work is more theoretical based, hence it can be used to define and
implement the rules for many language processing applications.
(Mukund and Ghosh 2011) describes the automatic extraction of the opinion holder
words and phrases from the given Urdu texts. This work refers the opinion holders and
their targets together as the opinion entities. It works in two steps; generate required word
sequences related to the opinion entities and disambiguate these extracted sequences as
the holders or targets of the opinions. The morphological operations like inflections are
used to correctly identify sequence boundaries for the verbs and nouns. Another work in
the context of classification of objective and subjective sentences is attempted by
(Mukund and Srihari 2010), which employs a vector space model.
2.13. Adjective based sentiment analysis techniques
As already mentioned in Section 2.1, the part of speech based features of the given text,
particularly of adjectives, can help a lot in sentiment analysis. Here, we emphasize on
these adjective based approaches used by NLP community. One of the earliest works in
this domain (Hatzivassiloglou & McKeown, 1997) uses adjectives as subjectivity
indicators. They employ a log-linear regression model for identification and validation of
the positive or negative semantic orientation of the conjoined adjectives. A clustering
algorithm divides the adjectives into groups with respect to orientations, and labels them
as positive or negative. Before that (Hatzivassiloglou & McKeown, 1993), present an
approach for automatic recognition of adjectival scales this approach group or cluster the
adjectives carrying same semantics, but this was not with the perspective of sentiment
analysis. (Bruce & Wiebe, 2000) recognize subjectivity within the text by manual
tagging. They take a case study of sentence level categorization and categorize clauses
from the “Wall Street Journal” as objective or subjective. Each clause is given a final
classification on the basis of an agreed decision by four judges.
(Hatzivassiloglou & Wiebe, 2000) analyze two main features of adjectives for
subjectivity prediction, i.e., gradability and semantic orientation. They extract reliability
______________________________________________________________________________________
Framework.
of gradability values using an automatic method for extracting. (Turney, 2002), suggest
that the proverbs are also carriers of sentiments in a sentence and should be considered in
combination with adjectives. In their work, the sentences are divided into pre-structured
grammatical patterns, which include adjectives and adverbs as the core word. (Riloff et
al., 2003) emphasize on the identification of the subjective nouns, which are modified by
the use of adjectives. They compute the orientation of the phrases in the sentence that
contained them. (Riloff & Wiebe, 2003), use unsupervised learning method for automatic
extraction and learning of the patterns for subjective expressions in the given text.
Table 2.6.
Research contributions related to adjective based sentiment analysis.
Hatzivassiloglou and McKeown (1993)

Hatzivassiloglou, McKeown (1997)
Adjectives
Bruce and Wiebe (2000)
Hatzivassiloglou, Wiebe (2000)
Adjectives and Proverbs Turney (2002)
Subjective nouns Riloff et al. 2003
Subjective expressions Riloff and Wiebe (2003)
Whitelaw et al. (2005)
Appraisal expressions
Bloom and Argamon (2010)
(Whitelaw et al., 2005) propose the use of appraisal theory for sentiment analysis. They
work on appraisal expressions extraction. These appraisal expressions are the sentiment
oriented phrases which contain adjectives as head words. (Bloom & Argamon, 2010)
extended this model and propose an approach for automatic learning of these appraisal
expressions. Research contributions related to adjective based sentiment analysis are
shown in Table 2.6.
______________________________________________________________________________________
Framework.
2.14. Term level vs. Phrase level polarity
According to a survey of the psychological aspects of natural language [55], only 4% of

the total words used in the written texts carry sentimental or affective content. It means to
analyze the sentimentality of a sentence cannot be obtained by analyzing only the 4%
content, and hence in addition to considering the subjective terms we need to explore
more words and phrases which mutually make the final sentiment of the given text. In
this regard the existing works for automatic sentiment classification principally fall into
two categories, i.e., word-level classification and phrase-level classification. The word-
level classification incorporates the polarity orientation of the words called prior
polarities. The phrase-level classification utilizes these prior polarities to calculate the
phrase level polarities.
2.8.1. Term-level-polarity based approaches

In the early contributions, the approaches concentrated mainly on determining the prior
polarities of the constituent terms only. The first effort for automatic sentiment analysis
in 1997, Hatzivassiloglou & McKeown, considered adjectives as the polar terms and
presented a scheme based on the conjunctions between different adjectives in a big
corpus. It incorporated shallow parsing algorithm for text chunking and constructed a
log-linear statistical model, which predicts same orientation between any two adjectives.
Before that, (Hatzivassiloglou & McKeown, 1993) focused on the automatic recognition
of the adjectival scales. It grouped or clustered the adjectives carrying same semantics.
But this work was not with the view of sentiment analysis.
After this input [Hatzivassiloglou & McKeown, 1997], two lines of research appeared in
the sentiment analysis research community. Firstly, they focused more on the adjectives
as the sentiment carriers. For example, [(Hatzivassiloglou & Wiebe, 2000)] evaluates two
key features of adjectives as subjectivity indicators, i.e., semantic orientation and
gradability. [(Bruce & Wiebe, 2000)] recognizes bias within the text by manual tagging.
______________________________________________________________________________________
Framework.
2.8.2. Phrase-level-polarity-based approaches

Then, the focus of research moved from word level prior polarity to the phrase level, or
expression level polarity analysis. (Riloff et al., 2003) stresses on the subjective nouns,
modified by the adjectives and computes the orientation of the phrases containing these
nouns. Whereas, (Turney, 2002), recommends that the proverbs are also carriers of
sentiments in an opinion and should be considered along with the adjectives. In this work,
the sentences are converted into pre-structured grammatical patterns with adjectives and
adverbs as the core terms. (Riloff & Wiebe, 2003), emphasize on the subjective
expressions in the given texts. Another work, (Whitelaw et al., 2005) proposes the
concept of appraisal expressions based on the appraisal theory. (Bloom & Argamon,
2010) extends this idea by proposing an approach for automatic learning of these
appraisal expressions. Table 2.7, gives the summary of the above discussed contributions.
Table 2.7.
Term-level polarity vs. phrase-level polarity approaches.
Hatzivassiloglou and McKeown (1993)

Term-level-polarity based Hatzivassiloglou and McKeown (1997)
approaches Bruce and Wiebe (2000)
Hatzivassiloglou and Wiebe (2000)
Turney (2002)
Phrase-level-polarity based Riloff and Wiebe (2003)
approaches Whitelaw et al. (2005)
Bloom and Argamon (2010)
Syed et al. (2010)
______________________________________________________________________________________
Framework.
2.15. Negation Handling in sentiment analysis
Negation handling in sentiment analysis as an independent task is not yet

a well solved issue, even for English text (Jia and Meng 2009) and
(Wiegand et al. 2010). This is because of the context sensitive use of the
negation particles. The first computational model for the treatment of the
negation is presented in (Polanyi and Zaenen 2004). It models negation via
contextual valence shifting. The polarity of a subjective expression is
reversed due to the use of negation mark. The work in (Kennedy and
Inkpen 2005) also proposes an approach for contextual valence shifting
and in addition to dealing with the simple negation particles; this work
decides a simple scope for negation, i.e., if the negation particle
immediately precedes a subjective expression then its polarity is flipped.
As an extension of this work, a parser is added for scope computation in
(Kennedy and Inkpen 2006).
Table 2.8.
Negation handling for sentiment analysis.
Contextual valance shifting Polanyi and Zaenen (2004)
Scope of negation Jia and Meng (2009)

Kennedy and Inkpen (2005)
Kennedy and Inkpen (2006)
Supervised machine learning Wilson et al. (2005)
Compositional semantics Moilanen and Pulman (2008)
______________________________________________________________________________________
Framework.
The work in (Wilson et al. 2005) uses supervised machine learning

method. It selects the features, like, negation features, shifter features,
and polarity modification features, for an advanced negation modeling. A
technique to compute the polarity of complex noun phrases and headlines
using compositional semantics is presented in (Moilanen and Pulman
2008). The research in (Jia and Meng 2009) investigates the effect of
different scope models of negation. It achieves the scope detection
through, the heuristic rules focused on polar expressions and static and
dynamic delimiters.
All of the above contributions treat the negation as independent lexical units, which can
affect the entire words, phrases or sentences. But, there are many cases in which, the
negation comes within the word structure, e.g., “‫( ”ﺑﮯ ﻓﺎﯾﺪه‬bay fayeeda, useless). There are
a few works addressing this type of negation (Moilanen and Pulman 2008).
Chapter review:
This Chapter describes the state of the art research in sentiment analysis and Urdu
language processing. The literature survey is divided into following sections; Features of
the given text, techniques, sentiment annotated lexicon construction, generalization
among domains, processing of morphologically rich languages, Urdu language
processing, adjective based SA techniques, term level vs. phrase level polarity and
negation handling in SA.
______________________________________________________________________________________
Framework.
Chapter 3| Distinctive Features of the Urdu Language 34
CHAPTER 3
DISTINCTIVE FEATURES OF THE URDU LANGUAGE
Prior to reporting our research contributions, there are some background issues that must
be presented and discussed. Firstly, we describe the Urdu language itself, which is the
main entity of this investigation. Urdu is introduced briefly to provide background for the
discussion in later chapters. As this language is not widely studied, therefore this section
contains more detail than would be necessary if a more recognizable language, such as
English or French, was being studied.
Language family: Indo-European

Influencing languages: Firstly: Persian, Arabic, and Turkish
Secondly: Sanskrit, English
Script: Persio-Arabic
Writing Style: Nastalique
Major Dialect: Hindi

Regions: National language of Pakistan, Widely spoken in
Afghanistan, Bahrain, Bangladesh, Botswana, Fiji,
Germany, Guyana, India, Malawi, Mauritius, Nepal,
Norway, Oman, Qatar, Saudi Arabia, South Africa,
Thailand, UAE, United Kingdom and Zambia
Table3.1. Brief overview of Urdu language

Urdu is an Indo-European language. Persian, Arabic, Turkish, and English have great
influence on the Urdu vocabulary, whereas, the grammar is more inclined towards
Sanskrit. Some major dialects of Urdu are Hindi, Dakhini, Pinjari, Rekhta, and Modern
______________________________________________________________________________________
Framework.
Vernacular Urdu. Moreover, Urdu is the national language of Pakistan and is widely
spoken in India, Afghanistan, Bangladesh, Bahrain, Oman, Saudi Arabia, South Africa,
and United Kingdom. Some Salient features of Urdu Language are given in Table 3.1.
The distinctiveness of a language is recognized by its inherent characteristics, which are
its orthography, vocabulary, parts of speech, grammar and morphology. We present here
a precise overview of these characteristics of the Urdu language:
3.1 Orthography
The orthography of a language specifies a standardized method for using a specific script
or writing system as a set of symbols (alphabets); graphemes and diacritics, and the rules
about how to write these symbols. It refers to the relationships between the graphemes
and phonemes for generating word spellings. It also identifies; the diacritics,
capitalization, hyphenation, word boundaries, punctuation marks and emphasis. The
orthography of the Urdu language is inclined toward the Arabic and Persian influences.
Figure 3.1 Character set of Urdu.

3.1.1. Character set
The character set of Urdu is an extended version of the Arabic character set used for
Persian. It has sounds which are not present in Arabic or Persian, including alveolar
consonants, aspirated stop and long vowels. There are 58 letters in Urdu, as given in
Figure 3.1 (Hardie 2003).
______________________________________________________________________________________
Framework.
The Arabic script employs letters to represent consonants and diacritics to indicate the
vowels. In Urdu both long and short vowels exist. Diacritics are used on the consonants
to specify the short vowels. Whereas, the long vowels are indicated by; a combined effect
of the consonant with diacritic and an additional letter. These diacritics are optional and
usually not written, but they exist implicitly and the native speaker understands their
pronunciation. From Figure 3.2 it is clear that the diacritics a consonant can have two
didactics and these can be written above or below the consonant.
ّ ‫ب ْ ب ًب ٰ ب‬
ُ ‫ب َ بِ ب‬
Figure 3.2 Diacritics in Urdu with letter “‫”ب‬.
3.1.2. Word order
Generally the basic word order of the Urdu clause is given as subject object verb (SOV).
Variation in this word order is common, particularly the reordering of nominal
constituents, especially for thematic purposes. This is the reason that (Butt, 1995) argues
that Urdu is a free order or a non-configurational language.
3.1.3. Bidirectional script

Like Arabic, Urdu script is bidirectional, it means the words are written from right to left
and numbers are written from left to right.
3.1.4. Ligatures
Urdu uses Persio-Arabic script, which is cursive and context-sensitive with respect to the
shapes of the alphabets. It means that the “‫( ”ﺣﺮوف‬haroof, alphabets) have multiple
glyphs and shapes and are categorized as joiners and non-joiners. The joiner alphabets
join together into units, called the ligatures (Durrani and Hussain 2010). One word can
have either single or multiple ligatures. During writing, all characters join together until a
non-joiner appears. A new ligature starts after the non-joiner. The process is repeated
until the word ends. If there are more than one ligatures present in a word then it seems
______________________________________________________________________________________
Framework.
that the word is having a space within, but this space is not their. Consider the example of
“‫( ”ﺟﺎﻧﻮر‬janwar, animal), this word have three ligatures which are written without space,
whereas the word “‫( ”ﮨﻤﺖ‬himat, courage) have only one ligature. There is also a
possibility of separation of the ligatures, even in the absence of a non joiner. For
example, “‫( ”ﮐﺒﮭﯽ ﮐﺒﮭﯽ‬kabhi kabhi, sometimes) and “‫( ”ﺑﮯ ﺟﺎن‬bay jaan, lifeless) this
phenomenon is very common in compounding and reduplication of the words.
An Urdu character exhibits multiple shapes according to its position in the ligature, i.e.,
in the initial, medial, or final position, or it remains unconnected. For example, consider
the alphabet “‫( ”ج‬jeem). It can be joined in initial position as “‫”ﺟﺎ‬, in medial position as
“‫ ”ﺑﺠﺎ‬and at final position as “‫”ﺣﺞ‬, see Table 3.2.
Remark Shape adjustment
Joined in the initial position ‫ ا‬+ ‫ج‬ ‫ﺟﺎ‬

Joined in the medial position ‫ ا‬+ ‫ ج‬+ ‫ب‬ ‫ﺑﺠﺎ‬
Joined at the final position ‫ ج‬+ ‫ح‬ ‫ﺣﺞ‬
In a word with a non-joiner ‫ ج‬+ ‫آ‬ ‫آج‬
Table 3.2. Different shapes of a single alphabet “ ”‫(ج‬jeem).
Due to this context sensitive orthography and difference in the behaviors of joiners and
non-joiners, the word boundary identification becomes a major task. The space is not
always an indicator of the word boundary.
3.2. Parts of Speech

According to the given literature (Ijaz and Hussain, 2007) and (Muaz and Hussain, 2009)
there are eleven unique parts of speech of the Urdu language:
1. Noun
2. Verb
3. Adjective
4. Adverb
5. Pronoun
______________________________________________________________________________________
Framework.
6. Post Positions
7. Numerals
8. Auxiliaries
9. Conjunctions
10. Haroof
11. Case markers
Among these, nine (from 1-9 in the above list) are similar to the English parts of speech
in their semantics (though, their morphological and grammar rules are clearly distinct).
While the “‫( ”ﺣﺮوف‬haroof) and case markers are different. The “haroof” are the words
which have no independent meaning. To become meaningful they are used with other
words (Schmidt, 2000). For example, “‫( ”اے‬ay), “‫( ”او‬o), “‫( ”واه‬wah), and “‫( ”ﻧﺎ‬na), etc.
3.3. Vocabulary
The absorption power of Urdu is quiet exceptional. In addition to Arabic, Persian, and
Turkish influences, Urdu kept on including the vocabulary from English, Sanskrit and
Hindi. This potential enhances the magnificence of the language. Table 3.3 gives some
examples of the Urdu words taken from English, Persian, Sanskrit, Arabic and Turkish,
along with their use in the sentences.
Language Borrowed words Example of Urdu Sentences
English ‫(ﭨﯿﻠﯽ ﻓﻮن‬telephone, Telephone) ‫ﭨﯿﻠﯽ ﻓﻮن ﺧﺮاب ﮨﮯ‬

(telephone khrab hay, Telephone is out of order.)
Persian ‫(ﻓﺮدوس‬firdos, heaven) ‫ﺳﻮات ﻓﺮدوس ﻧﻈﯿﺮ ﮨﮯ‬
(sawat firdos nazeer hay, Sawat is like heaven.)
Sanskrit ‫(آﺷﺎ‬aasha, wish) ‫ﻣﯿﺮی آﺷﺎ ﭘﻮری ﮨﻮﮔﺊ‬
(meri aasha puri ho gayee, My wish came true.)
Turkish ‫(ﺧﺎﺗﻮن‬khatoon, lady) ‫وه اﯾﮏ ﻧﻔﯿﺲ ﺧﺎﺗﻮن ﮨﯿﮟ‬
(woh aik nafees khatoon hain, She is a fine lady.)
Arabic ‫(ﺟﻨﺖ‬janat, heaven) ‫ﮔﮭﺮ ﺟﻨﺖ ﮨﮯ‬
(ghar janat hay, Home is heaven.)
Table 3.3 Examples of Urdu words from multiple languages.
______________________________________________________________________________________
Framework.
3.4. Morphology
Morphology can be defined as the study of the structure of the word. For example, the
word “‫( ”ﻟﻔﻆ‬lafz, word) describes how “‫( ”اﻟﻔﺎظ‬alfaaz, words) is inflected from it. The
definition of morphology leads to the concept of morpheme the smallest unit of meaning
or smallest recurring unit. The relation of morphology to morpheme is same as that of the
syntax to the words. Morphemes express concepts like “‫( ”ﺑﺎدل‬badal, cloud), “‫”ﭘﻨﮑﮭﺎ‬
(pankha, fan), or relationship like “‫( ”ﻣﻨﺪ‬mand) in “‫( ”دوﻟﺖ ﻣﻨﺪ‬dolat mand, rich) and “‫”ﺑﮯ‬
(bay) in “‫( ”ﺑﮯ ﺟﺎن‬bayjaan, lifeless). Also morphemes can express syntactic features for
example number (singular, plural) e.g., “‫( ”ﭘﻮدا‬poda, plant), “‫( ”ﭘﻮدے‬poday, plants)
Gender (male, female) e.g., “‫( ”ﮔﯿﺎ‬gya, went, inflected for masculine), “‫( ”ﮔﺊ‬gayee, went,
inflected for feminine).
The term morph represents morphemes as parts of a word, e.g., In the word “‫( ”ﭘﺮ‬pur,
feather) the morpheme “‫( ”ﭘﺮ‬pur) is realized as the morph “‫( ”ﭘﺮ‬pur) to form the word
“‫( ”ﭘﺮ‬pur, feather). In, “‫( ”ﭘﺮوں‬puron, feathers), the morpheme “‫( ”ﭘﺮ‬pur) and the
PLURAL morpheme are realized as “‫ ”ﭘﺮ‬+”‫( ”وں‬pur+oon) respectively to form the word
“‫( ”ﭘﺮوں‬puron, feathers).
The term allomorphs represent different forms of a morpheme. e.g., the PLURAL
morpheme in Urdu has several allomorphs. Plural of “‫( ”ﭘﻮدا‬poda, plant) is “‫( ”ﭘﻮدے‬poday,
plants) plural of “‫( ”ﭘﮭﻮل‬phool, flower) is “‫( ”ﭘﮭﻮﻟﻮں‬phoolon, flowers). The morphemes are
further categorized as free morphemes (can form words by themselves) e.g., “‫”ﺑﺎرش‬
(barish, rain), “‫( ”آﺳﻤﺎن‬aasman, sky) and bound morphemes (must be combines with
other words) to form words e.g., “‫( ”ﺑﺎ‬ba) in “‫( ”ﺑﺎﻋﺰت‬baizat, respectable). Words can be
found as free morphemes only, bound morphemes only, free and bound morphemes
jointly.
As far as Urdu morphology is concerned, it lies in the category of morphologically rich
languages (MRLs) like Arabic, Persian, Chinese, Turkish, Finnish, and Korean. The
MRLs require considerable challenges for natural language processing, machine
translation and speech processing (Abdul-Mageed and Korayem, 2010). These languages
are distinctive due to highly productive and frequent morphological processes at the word
level, e.g., compounding, reduplication, inflection, agglutination and derivation, etc. Due
______________________________________________________________________________________
Framework.
to these morphological operations the same root words can generate multiple word forms.
This makes the stemming process quite challenging.
Also, the Lexicons of MRLs tend to be more complex. The dependencies and
relationships between different parts of speech are frequent. This increases the levels of
intricacy, which result into inflection or derivation gaps, because various forms of the
same underlying base-form can easily be misidentified as unrelated entries with negative
effects on the overall alignment of words and hence, on the processing accuracy.
Some frequent morphological processes for Urdu are discussed below:
3.4.1. Inflection and derivation
Inflectional operations deal with the variety of forms of the same words. The changes
indicate grammatical features, e.g., “‫( ”ﺟﺎﻧﺎ‬jana, to go) from “‫( ”ﺟﺎ‬ja, go). The difficult
aspect of these inflections is their diversity. For example, for making a plural in English
s, es or ies are used according to the predefined grammatical rules. Exceptions are there,
but are rare.
On contrary, in Urdu language, the Arabic loan words are made plural according to
Arabic grammar, whereas, the Persian loan words follow the Persian grammar and so on.
For example, the plural of “‫( ”ﻟﻔﻆ‬lafz, word) is “‫( ”اﻟﻔﺎظ‬alfaaz, words) and “‫( ”ﭘﻮدا‬poda,
plant) is “‫( ”ﭘﻮدے‬poday, plants). Both are differently inflected to make plural word.
Derivational operations deal with the production of new words with different meanings.
The new words are produced by adding affixes. Often the produced words have a
changed part of speech, e.g., “‫( ”ﺧﻮش‬khush, happy) and “‫( ”ﺧﻮش ﺑﺨﺖ‬khushbakht, lucky).
3.4.2. Compounding
The compounding process results into new words which are made by a combination of
two already existing words M and N. Some examples of compound words in Urdu are:
 MN formation: M and N are independent in meaning and syntax but they are only
written together to make a new word.
______________________________________________________________________________________
Framework.
For example, M = “‫( ”ﻣﻮم‬mom, wax), N = “‫( ”ﺑﺘﯽ‬bati, light), make the word MN = “ ‫ﻣﻮم‬
‫( ”ﺑﺘﯽ‬mombati, candle).
 M-O-N formation: M and N are independent words, but are related in meaning or
context. Their syntax remains the same with an additional alphabet “‫( ”و‬O). This
alphabet “‫( ”و‬O), means “and”.
For example, M = “‫( ”ﻣﻠﮏ‬mulk, country), N = “‫( ”ﻣﻠﺖ‬milat, nation), make the
compound word, M-O-N = “‫( ”ﻣﻠﮏ و ﻣﻠﺖ‬mulk-o-milat, country and nation).
3.4.3. Reduplication
Both full and partial reduplication of words is very common in Urdu. For example, the
full reduplication of the word “‫( ”ﮐﺒﮭﯽ‬kabhi, sometime), result into “‫( ”ﮐﺒﮭﯽ ﮐﺒﮭﯽ‬kabhi
kabhi, infrequently).
3.4.4. Compound verbs and verb phrases

In Urdu root verbs and intensifying verbs combine together to form compound verbs
(Schmidt 1999). For example, the root verb “‫( ”ﭘﮑﺎر‬pukar, call) and intensifying verb “‫”ﻟﻮ‬
(lo, take) make a compound verb “‫( ”ﭘﮑﺎر ﻟﻮ‬pukarlo, call (right away)). This compound
verb has the same meaning as the root verb but exhibit more strength. Table 3.4 gives
further examples of the discussed morphological processes, i.e., inflection, derivation,
compounding, reduplication and compound verbs.
Operation Word Modified form
Inflection ‫(ﭘﮭﻮل‬phool, flower) ‫(ﭘﮭﻮﻟﻮں‬phool-on, flowers)

Derivation ‫(ﻣﻤﮑﻦ‬mumkin, possible) ‫(ﻧﺎﻣﻤﮑﻦ‬na-mumkin, impossible)
Compounding ‫(ﺟﺎن‬jaan, soul), ‫(دل‬dil, heart) ‫(دل و ﺟﺎن‬dil-o-jaan, heart and soul)
Partial Reduplication ‫(رات‬raat, night) ‫(راﺗﻮں رات‬raat-on-raat, in a night)
Compound verbs ‫(ﻣﺎر‬maar, beat), ‫(ڈاﻟﻮ‬dalo, put) ‫(ﻣﺎر ڈاﻟﻮ‬maar dalo, kill)
Table 3.4. Examples of morphological processes in Urdu.
______________________________________________________________________________________
Framework.
3.5. Challenging features of the Urdu language
So far we have gone through an overview of the Urdu language. Here, we precisely
describe the challenges posed due to the distinctive features. These aspects are related to
the task of sentiment analysis, like corpus collection, lexicon construction, and word
boundary identification, etc.:
3.5.1. Corpus construction

Urdu websites are becoming popular day by day but still these cannot be used for corpus
construction because such a task needs large amount of electronic text. This is an
unfortunate fact that most of the Urdu websites use graphic formats i.e. gif or other image
formats, to display Urdu text [12].
Despite of these hurdles there are some significant efforts. For example, a relatively small
corpus (20,000–50,000 words) for Urdu language is available at
http://personal1.stthomas.edu/dmbecker/. In this corpus, the documents appear in a
minimally tagged format.
3.5.2. Complex stemming

As with other MRLs the stemming of Urdu language is complex, because various words
emerge from the same root. For example, the root word “‫( ”ﻋﻠﻢ‬ilm, knowledgw) generates
multiple words with different meanings and forms, as given in Table 3.5.
3.5.3. Intricate lexicon
In most of the NLP applications a lexicon is a main requirement. For sentiment analysis,
this lexicon becomes more complex because it contains sentiments annotated to all
entries in addition to their grammatical and morphological information.
The Urdu language is a blend of languages spoken by the military troops, who invaded
the subcontinent in different eras and the local languages. Therefore, Urdu, previously
known as Rekhta (‫)رﯾﺨﺘہ‬, meaning molded or mixed, have strong linguistic influences
from Arabic, Persian, Turkish, Sanskrit and English, etc. For example, the words, “‫”ﺷﻤﺲ‬
(shams, sun), “‫( ”ﺑﮩﺘﺮ‬behter, better), “‫( ”ﭨﯿﻠﯽ وﯾﮋن‬televizun, television) and, “‫( ”ﭘﻮﺟﺎ‬pooja,
______________________________________________________________________________________
Framework.
worship) are Arabic, Persian, English, and Sanskrit loan words, respectively. Due to this
variability, the morphological operations use varying grammar rules. Most of the loan
words follow the grammar rules of their parent language. Generally, the Sanskrit based
adjectives show inflection to agree with the noun they qualify, this property is called
marking with respect to case, gender, or number. Like the demonstrative adjective “‫”ﺟﯿﺴﺎ‬
(jaisa, such as), becomes “‫( ”ﺟﯿﺴﯽ‬jaisee, such as) and “‫( ”ﺟﯿﺴﮯ‬jaisay, such as) for gender
and number, respectively. On the other hand, most of the Persian loan words like “‫”ﺗﺎزه‬
(tazah, fresh) remain unmarked, because, they follow Persian grammar.
Technically, these features result into much intricate lexicons for natural language
processing applications. There is a much higher out of vocabulary rate as compared to
other well defined grammars. Also, it results into poor or unreliable language model
probability estimation, because there are many combinations of word forms which are
missing or rarely available in the language model training data.
Root word “‫( ”ﻋﻠﻢ‬ilm, knowledgw)

Inflected Words “‫( ”ﻋﺎﻟﻢ‬aalim, knowledgeable),
“‫( ”ﻋﺎﻟﻤہ‬aalimah, female knowledgeable)
“‫( ”ﻣﻌﻠﻢ‬moalim, educator)
“‫( ”ﻣﻌﻠﻤہ‬moalimah, female educator)
“‫( ”ﻣﻌﻠﻮم‬maaloom, know)
“‫( ”ﻣﻌﻠﻮﻣﺎت‬maaloomaat, information)
Table 3.5 Inflection of multiple words from root word “‫ ”ﻋﻠﻢ‬in the Urdu language.
3.5.4. Word boundary identification

Urdu employs Persio-Arabic script, with Nastalique writing style. Its orthography is
context sensitive and the “‫( ”ﺣﺮوف‬haroof, alphabets) exhibit different shapes according to
their positions in a word. For example, “‫”ﮔﺎ‬, “‫”ﺟﮓ‬, “‫”ﻣﮕﻦ‬, “‫”ﺟﺎگ‬, are four different
shapes of the same alphabet “‫( ”گ‬gaaf). Urdu alphabets are categorized as joiners and
non joiners (Lehal, 2010). The joining alphabets make ligatures. One word can have one
or more than one ligature due to the existence of a non joiner within a word. For example,
______________________________________________________________________________________
Framework.
“‫( ”ﺟﮕﻨﻮ‬jugnu, firefly) is a single ligature word but the word “‫( ”ﺟﺎﮔﻮ‬jago, wakeup) has
two ligatures. If the ending letter of a word is a joiner then it tends to join with the first
letter of the next word, resulting into a misidentification of the word boundaries. For
example, “‫( ”ﮐﻞ رات‬kal raat, tomorrow night) are two different words and are written
with space but if by mistake this space is omitted then the last non joiner of the first word
will join with the first letter of the second word and it will become “‫( ” ﮐﻠﺮات‬kalraat).
Hence, the spaces are not always true indicators of the word boundaries as in English
text.
3.5.5. Diacritics omission

Like Arabic, diacritics are present in Urdu as vowels. But, their use is not standardized
and is author dependent. Hence, they are removed as a preprocessing step of any NLP
application. This is an accepted practice adopted by the Urdu language research
community (Durani and Hussain, 2010).
3.5.6. Code switching

Another interesting feature of Urdu is code switching. In linguistics, code switching
means using multiple languages concurrently. This phenomenon is very common in Urdu
writing. For example, “‫ ﮐﺮ دو‬mobile off” (Mobile off kar do, Turn off the mobile) means
“switch off the mobile”. This property causes disambiguation of the accurate lexical
category or part of speech.
3.5.7. Independent case marking
Case markers are defined as the relational morphemes or the lexical units or words,
which mark the grammatical functions to the words with which they are used. In Urdu,
the case markers are syntactically attached with the words but are lexically independent.
It means they are treated with independent POS tags (Rizvi et al., 2005). They affect the
structure of the sentence and can cause grammatical ambiguities, like; the free word order
property of Urdu text is due to case markers. For example, both the phrases; “ ‫رﻧﮕﻮں ﮐﮯ‬
‫( ”ﻧﺎم‬rangoon kay naam, colors’ names) and “‫( ”ﻧﺎم رﻧﮕﻮں ﮐﮯ‬naam rangoon kay, colors’
names) are correct and have same meaning, but different word order due to the use of the
______________________________________________________________________________________
Framework.
case marker “‫( ”ﮐﮯ‬kay). Some more examples of the use of case markers are “ ‫اﯾﺮان ﮐﺎ‬
‫( ”ﺑﺎدﺷﺎه‬Iran ka badshah, king of Persia), and “‫( ”ﺷﯿﺸﮯ ﮐﯽ ﺑﻮﺗﻞ‬sheeshay ki bottle, glass
bottle).
Moreover, Urdu text contains two types of affixes, (a) morphemes and (b) words or
lexical units. Morphemes are lexically attached with the nouns through morphological
operations. For example, to make plural “‫( ”ﭘﻮدے‬poday, plants) of the word “‫( ”ﭘﻮدا‬poda,
plant) plural postfix “‫( ”ے‬ay) is applied as shown in Table 3.6.
While the words or lexical units are independent units. These are further categorized as
case markers, pure postpositions and possession or genitive markers. The case markers
are further divided into core case markers and oblique case markers. They mark
grammatical function to the marked words and are generally, morphologically attached
with the words at the lexical level. But, in Urdu, they are syntactically attached and
lexically independent.
1. Morphemes ‫ﭘﻮدا‬ ‫ﭘﻮدے‬plural postfix ‫(ے‬ay) is applied

2. Words or lexical units ‫(ﻧﺎ‬na), ‫(ﻧﯽ‬ni), ‫(ﻧﮯ‬nay), ‫(ﺳﮯ‬say)
2.1. Case marker ‫(ﺳﮯ‬say), ‫(ﮐﻮ‬ko)
2.1.1. Core case markers ‫(ﻣﯿﮟ ﻧﮯ ﮐﮩﺎ‬mein nay kaha, I said)
2.1.2. Oblique case marker ‫(ﺑﺎﮨﺮ ﻧﮑﺎﻟﻨﺎ‬bahir nikal-na, put out)
2.2. Pure postpositions ‫(ﮐﻤﺮے ﻣﯿﮟ ﺟﺎ‬kamray mein ja, go to the room)
2.3. Possession or genitive markers ‫(آپ ﮐﺎ ﻧﺎم‬aap ka naam, your name)
Table 3.6. Examples of affixes, case markers and postpositions.
As an example of core case markers consider the sentence, “‫( ”ﻣﯿﮟ ﻧﮯ ﮐﮩﺎ‬mein nay kaha, I
said), in which the case marker “‫( ”ﻧﮯ‬nay) is used. Similarly, in the sentence “‫”آپ ﮐﺎ ﻧﺎم‬
(aap ka naam, your name), the possession marker “‫( ”ﮐﺎ‬ka) is used. Table 3.6 gives some
more examples.
______________________________________________________________________________________
Framework.
‫( ﻧﮩﯿﮟ ﺗﯿﺮا ﻧﺸﯿﻤﻦ ﮐﺜﺮﺳﻠﻄﺎﻧﯽ ﮐﮯ ﮔﻨﺒﺪ ﭘﺮ‬naheen tera nasheman kasr-e sultani kay gunbad par)
‫( ﺗﯿﺮا ﻧﺸﯿﻤﻦ ﮐﺜﺮﺳﻠﻄﺎﻧﯽ ﮐﮯ ﮔﻨﺒﺪ ﭘﺮ ﻧﮩﯿﮟ‬tera nasheman kasr-e sultani kay gunbad par naheen)
‫( ﺗﯿﺮا ﻧﺸﯿﻤﻦ ﮔﻨﺒﺪ ﮐﺜﺮﺳﻠﻄﺎﻧﯽ ﭘﺮ ﻧﮩﯿﮟ‬tera nasheman gunbad-e kasr-e sultani par naheen)
Translation: Your home is not on the tower of the king’s palace)
Table 3.7. Free word order property of the Urdu text.
3.5.8. Free word order

As already mentioned in Section 3.1.2 generally Urdu follows subject object verb (SOV)
word order. But variation in this word order is frequent and well accepted that is why
some researchers suggest that Urdu should be considered as a free word order language
(Butt, 1995).
An example sentence is given in Table 3.7, which is written in three different ways with
same semantics.
The property of free word-order in Urdu text is due to;
 The case markers, which can identify constituents in multiple ways (Rizvi and
Hussain 2005). These are lexically independent and are considered as independent
lexical category.
 Diacritics used as possession markers. As these are optional so proper identification
of the meaning becomes very difficult.
Chapter review:
This chapter precisely described the linguistic characteristics of the Urdu language. From
this description we believe that Urdu language is unique in a number of aspects related to
its orthography, morphology, grammar and vocabulary. Its distinctive linguistic features
make it a challenging domain for the sentiment analysis community. Hence, they require
updated or altogether different algorithms and approaches to analyze the sentiment
orientation of the Urdu text.
______________________________________________________________________________________
Framework.
Chapter 4| SentiUnits: The Appraisal Expressions 47
CHAPTER 4
SENTIUNITS: THE APPRAISAL EXPRESSIONS
In an opinionated sentence all the terms are not subjective. Indeed the sentimentality of a
sentence depends only on some specific words or phrases. Consider the examples “This
book is very good.” and “The movie is boring.” underlined words are the expressions
made of one or more words which carry the sentiment information of the whole sentence.
We label them as SentiUnits. We can judge only these units as the representatives of the
whole sentence’s sentiment. These are in fact the appraisal expressions as defined and
discussed in Chapter 2. The SentiUnits can be defined as the core grammatical structures,
expressing the opinion or the sentiment carrier expressions in a sentence (Syed et al.
2010). For understanding the structure of the SentiUnits, consider the following examples
from Urdu text in Table 4.1.
This is a fine book. Yeh aik umdah kitab hay . ‫ﯾہ اﯾﮏ ﻋﻤﺪه ﮐﺘﺎب ﮨﮯ‬1
This is a fine and informative book. Yeh umdah aur malumati kitab hay . ‫ﯾہ ﻋﻤﺪه اور ﻣﻌﻠﻮﻣﺎﺗﯽ ﮐﺘﺎب ﮨﮯ‬2
This is the finest book. Yeh sab se umdah kitab hay . ‫ﯾہ ﺳﺐ ﺳﮯ ﻋﻤﺪه ﮐﺘﺎب ﮨﮯ‬3
This book is not very bad. Yeh kitab itni buri naheen . ‫ﯾہ ﮐﺘﺎب اﺗﻨﯽ ﺑﺮی ﻧﮩﯿﮟ‬4
Table 4.1. Examples of opinionated sentences from Urdu with different SentiUnits.
In Table 4.1, the underlined expressions are responsible for subjectivity orientation. All
other words are neutral and have no effect on the classification. On a closer look at these
examples, we can observe that the SentiUnits are made of adjectives (as head words).
These can be single word/adjective based like sentence 1, or multiple words based like
sentences 2, 3 and 4. The sentence 1, 2 and 3 have adjectives with positive orientation,
whereas, the sentence 4 contains a negative word but due to the use of negation it
becomes positive. In this case, negation acts as a polarity shifters. Moreover, the intensity
______________________________________________________________________________________
Framework.
of the expressions is determined by the modifiers which can be absolute, comparative, or

superlative just like English text. Sentence 3 represents the example of the superlative
degree of the appraisal.
Hence, these expressions can be distinguished by six attributes, i.e., adjectives as the head
words, their modifiers, and their orientation towards positive or negative, the intensity of
this orientation, a polarity mark assigned to each word to show the intensity value and
finally the negation (Syed et al. 2010). We consider two types of SentiUnits:
Single Adjective Phrases are made of adjective head and possible modifiers, e.g. “ ‫ﺑﮩﺖ‬
‫( ”ﺧﻮش‬bohat khush, very happy), “‫( ”زﯾﺎده ﺑﮩﺎدر‬zyada bhadur, more brave)
Multiple Adjective Phrases comprise of more than one adjective with a delimiter or a
conjunction in between, e.g. “‫( ”ﺑﮩﺖ ﭼﺎﻻک اور طﺎﻗﺘﻮر‬bohat chalak aur taqatwar, very clever
and strong).
As mentioned above SentiUnit can be described by following attributes:
1. Adjectives (as head words)
2. Modifiers
3. Orientation
4. Intensity
5. Polarity
6. Negation
These are described briefly in following paragraphs:
4.1. Adjectives
An adjective is a fundamental part of speech (POS) that expresses an attribute of a noun
(place, thing or, person). Generally in the sentence structure adjectives appear in two
ways, whether they are directly linked with the noun within the noun phrase or they
associate with the noun through some other part of speech, e.g., verb. In both cases they
describe the characteristic features of the noun they qualify. This point suggests that any
opinion, sentiment or judgment about a noun can be determined by analyzing its
adjectives. Due to this characteristic the first effort for the automatic sentiment analysis
(SA) of the English text employ adjectives as the main feature of the given text
______________________________________________________________________________________
Framework.
(Hatzivassiloglou & McKeown, 1993). Therefore, in sentiment analysis community,

adjectives remain center of attention (Turney, 2002), (Riloff et al., 2003), (Riloff &
Wiebe, 2003) and (Bloom & Argamon, 2010).
As with all parts of speech, in every language the use, type, and structure of the
adjectives differ. Urdu is morphologically rich and hence its adjectives and adjectival
phrases tend to be more complex, due to the frequent inflections and derivations. In
addition to the morphological complexity the variability in vocabulary and grammar rules
in Urdu text is regular and is considered normal. This is due to the fact that this language
is strongly influenced by many other languages like, Persian, Arabic, Sanskrit and
English. For example, the adjective “‫( ”ﺗﺎزه‬tazah, fresh) remain unmarked because it is
Persian loan word and follow Persian grammar, whereas as most of the Sanskrit based
adjectives show inflection to agree with the noun they qualify. For example, the
demonstrative adjective “‫( ”ﺟﯿﺴﺎ‬jaisa, such as), becomes “‫( ”ﺟﯿﺴﯽ‬jaisee, such as) and
“‫( ”ﺟﯿﺴﮯ‬jaisay, such as) for gender and number, respectively. Moreover, the use of post
positions as independent lexemes involves more specific patterns and rules.
These aspects suggest that Urdu have distinct characteristics and features. We therefore,
present in this section a comprehensive overview of the structures of the adjectival
phrases in the Urdu text with respect to the task of sentiment analysis. For Urdu based
NLP research this is the very first effort presented in (Syed et al 2011 b). So far, syntactic
and morphological aspects of the language are considered related to verbs, nouns and
other parts of speech. But, we find no contribution which investigates Urdu adjectival
phrases discretely.
The given analysis covers almost all aspects of adjectives and adjectival phrases. We
describe their morphological structures, as marked and unmarked through the types of the
agreement with the noun they qualify. This agreement is more frequent for gender,
number, and case. Also, we discuss their structure when used with a sequence of nouns
and for the formations of reduplications. Moreover, we define and illustrate with
examples different adjective classes, i.e., descriptive, predicative, attributive, possessive,
demonstrative, and reflexive possessive adjective. For each class we describe the
______________________________________________________________________________________
Framework.
morphological structure of the adjectives and their inflected forms. We take most
commonly used adjectives as examples and clearly describe their modifications.
Figure 4.1 Types of adjectives in Urdu.

In linguistics, for understanding the parts of speech (POS) of a language we need to
recognize their morphological structures and the processes through which these structures
are made. Another significant aspect is to look at their different forms or classes.
Therefore, we explore in this Section these two features of Urdu adjectival phrases. We
first describe their morphological structures and then the classes.
4.1.1. Morphological structure of adjectives

The morphological structure of the Urdu adjectives is complex and exhibit frequent
inflections and derivations with the agreement of the noun they qualify.
Conceptually, adjectives in Urdu can be divided into two types (Schmidt 1999). First, are
those describing quantity and quality, e.g. “‫( ”ﮐﻢ‬kam, less), “‫( ”ﺑﺪﺗﺮﯾﻦ‬budtareen, worst),
“‫( ”زﯾﺎده‬ziyada, more). And the second type of adjectives distinguishes one person from
other, e.g. “‫( ”ﺣﺴﯿﻦ‬haseen, pretty), “‫( ”ﻓﻄﯿﻦ‬fateen, intelligent).
______________________________________________________________________________________
Framework.
Morphologically, adjectives are categorized as marked and unmarked (Schmidt 1999).

Marked are those which can be inflected for number and gender, e.g., (a) “‫( ”اﭼﮭﺎ ﮐﺎم‬acha
kaam, good work), (b) “‫( ”اﭼﮭﮯ ﮐﺎم‬achay kaam, good works) and (c) “‫( ”اﭼﮭﯽ ﺑﺎت‬achi
aadat, good habit). In (a), (b), and (c) “‫( ”اﭼﮭﺎ‬acha, good) is inflected for masculine,
plural and feminine, respectively. Unmarked are usually Persian loan words, e.g., “‫”ﺗﺎزه‬
(tazah, fresh) and the adjectives inflected from nouns, e.g., “‫( ”دﻓﺘﺮی‬daftary, official)
inflected from “‫( ”دﻓﺘﺮ‬daftar, office). Attributive adjectives are very frequent and they
precede the noun they qualify, e.g., the adjective “‫( ”ﻣﺰﯾﺪار‬mazedaar, tasty) precede the
noun “‫( ”ﻣﺰه‬maza, taste). Arabic and Persian loan adjectives are used predicatively, e.g.,
“‫( ”ﻣﻌﻠﻮم ﮨﻮﻧﺎ‬maloom hona, to be Known).
These adjectives appear in the form of phrases. The postposition “‫( ”ﺳﮯ‬say), “‫( ”ﺳﯽ‬si),
“‫( ”ﺳﺎ‬sa) and “‫( ”واﻻ‬wala), “‫( ”واﻟﯽ‬wali), “‫( ”واﻟﮯ‬walay) are very frequently used with
noun to make adjectives, e.g., “‫( ”ﭘﮭﻮل ﺳﯽ‬phool si, like flower) from “‫( ”ﭘﮭﻮل‬phool,
flower). This discussion is summarized in Figure 4.1.
The unmarked adjectives do not show any inflection according to the nouns they qualify.
In other words they do not alter to show agreement with nouns through suffixes. Most of
the Persian loan adjectives remain unmarked. Table 4.2 shows some examples of
unmarked adjectives; “‫( ”دﻟﭽﺴﭗ‬dilchasp, interesting) and “‫( ”ﺑﮩﺘﺮ‬behtur, better) with the
nouns (a) masculine-singular, “‫( ”ﮐﺎم‬kaam, task), (b) feminine-singular, “‫( ”ﮐﮩﺎﻧﯽ‬khani,
story) and feminine-plural “‫( ”ﮐﮩﺎﻧﯿﺎں‬khanian, stories).
With masculine-singular noun With masculine-feminine noun With plural noun

“‫( ”دﻟﭽﺴﭗ ﮐﺎم‬dilchasp kaam, “‫( ”دﻟﭽﺴﭗ ﮐﮩﺎﻧﯽ‬dilchasp khani, “‫( ”دﻟﭽﺴﭗ ﮐﮩﺎﻧﯿﺎں‬dilchasp khaniian,
interesting task) interesting story) interesting stories)
“‫( ”ﺑﮩﺘﺮﮐﺎم‬behtur kaam, “‫( ”ﺑﮩﺘﺮ ﮐﮩﺎﻧﯽ‬behtur khani, better “‫( ”ﺑﮩﺘﺮ ﮐﮩﺎﻧﯿﺎں‬behtur khanian,
better task) story) better stories)
Table 4.2 Examples of unmarked adjectives
a) Adjective marking: agreement in gender and number: The adjective marking is done
through the suffixes for gender; masculine (m) and feminine (f) and for number; singular
(s) and plural (p). For example, the masculine adjective, “‫( ”اﭼﮭﺎ‬acha, good) is inflected
______________________________________________________________________________________
Framework.
for gender as “‫( ”اﭼﮭﯽ‬achi, good) and for number as “‫( ”اﭼﮭﮯ‬achay, good). These suffixes
are attached to agree with the noun or nouns, which the adjective qualifies. Therefore,
there are three suffixes, i.e., singular-masculine (a), singular-feminine (ee) and plural-
masculine (ay). Only one feminine suffix (ee) is used for singular and plural both.
Some examples of marked adjectives are given in Table 4.3, in this table we have
considered three nouns; (a) masculine-singular, “‫( ”ﺑﭽہ‬bacha, kid), (b) feminine-singular,
“‫( ”ﮐﺎر‬car, car) and masculine-plural “‫( ”دن‬din, days). These nouns cause inflection in the
respective adjectives; “‫( ”اﭼﮭﺎ‬acha, good), “‫( ”ﻟﻤﺒﺎ‬lamba, long), and “‫( ”ﺑﺮا‬bura, bad).
Adjective (m, s) Inflected for gender (f) Inflected for number (m, p)
“‫( ”اﭼﮭﺎ ﺑﭽہ‬acha bacha, good kid) “‫( ”اﭼﮭﯽ ﮐﺎر‬ache car, good car) “‫( ”اﭼﮭﮯ دن‬achay din, good days)
“‫( ”ﻟﻤﺒﺎ ﺑﭽہ‬lamba bacha, tall kid) “‫( ”ﻟﻤﺒﯽ ﮐﺎر‬lambee car, long car) “‫( ”ﻟﻤﺒﮯ دن‬lambay din, long days)
“‫( ”ﺑﺮا ﺑﭽہ‬bura bacha, bad kid) “‫( ”ﺑﺮی ﮐﺎر‬buree car, bad car) “‫( ”ﺑﺮے دن‬buray din, bad days)
Table 4.3 Adjective marking with gender and number.
b) Agreement in case: Urdu nouns have three cases; oblique, nominative and vocative.
The adjectives that qualify an oblique noun also become oblique.
The masculine-singular suffixes (a) and (an) are replaced by, (ay) and (ayn), respectively.
The feminine adjectives remain the same as shown in Table 4.4.
Masculine Feminine
Nominative “‫( ”ﭼﮭﻮﭨﺎ‬chota, little) “‫( ”ﭼﮭﻮﭨﯽ‬chotee, little)
“‫( ”ﺳﺎﺗﻮاں‬satwan, seventh) “‫( ”ﺳﺎﺗﻮﯾﮟ‬satween, seventh)
Oblique “‫( ”ﭼﮭﻮﭨﮯ‬chotay, little) “‫( ”ﭼﮭﻮﭨﯽ‬chotee, little)
“‫( ”ﺳﺎﺗﻮﯾﮟ‬satwayn, seventh) “‫( ”ﺳﺎﺗﻮﯾﮟ‬satween, seventh)
Vocative “‫( ”ﭼﮭﻮﭨﮯ‬chotay, little) “‫( ”ﭼﮭﻮﭨﯽ‬chotee, little)
“‫( ”ﺳﺎﺗﻮﯾﮟ‬satwayn, seventh) “‫( ”ﺳﺎﺗﻮﯾﮟ‬satween, seventh)
Table 4.4 Marking of adjectives for cases.
c) Adjectives with noun sequences: Sometimes adjectives appear in a sentence with more
than one noun or multiple nouns making a sequence. In this case the nouns may differ in
gender and number.
______________________________________________________________________________________
Framework.
The adjective agrees with the noun, which is nearest to it. Examples are given in Table
4.5, in which, “‫( ”ﺑﮍا‬bara, big) inflects for “‫( ”ﭘﻠﻨﮓ‬palang, bed) and “‫( ”ﭼﮭﻮﭨﯽ‬choti,
younger) inflects for “‫( ”ﺧﺎﻟہ‬khala, aunt).
Adjective With the sequence of nouns

“‫( ”ﺑﮍا‬bara, big) “‫”ﺑﮍا ﭘﻠﻨﮓ اور اﻟﻤﺎرﯾﺎں‬
(bara palang aur almarian, big bed and cupboards)
“‫( ”ﭼﮭﻮﭨﯽ‬choti, younger) “‫ ﻣﺎﻣﻮں اورﺑﭽﮯ‬،‫”ﭼﮭﻮﭨﯽ ﺧﺎﻟہ‬
(choti khala, mamoon aur bachay, younger aunt, uncle and kids)
Table 4.5 Adjective agrees with the nearest noun in a sequence.
d) Reduplication of Adjectives: Urdu adjectives show reduplicate either fully or partially.

In full reduplication the whole word is repeated as it is, whereas in partial reduplication
some syllables of the word are reduplicated with different spellings. Examples of full and
partial reduplication are given in Table 4.6.
Partial Reduplication Full Reduplication

“‫”ڈھﯿﻼ ڈھﺎﻻ ﻟﺒﺎس‬ “‫”ﺑﮍے ﺑﮍے ﮐﺎم‬
(dheela dhala libas, loose dress) (baray baray kaam, great tasks)
“‫”ﭼﮭﻮﭨﯽ ﻣﻮﭨﯽ ﺑﺎت‬ “‫”ﭼﮭﻮﭨﯽ ﭼﮭﻮﭨﯽ ﺑﺎﺗﯿﮟ‬
(choti moti baat, minute matter) (choti choti batain, minute matters)
Table 4.6 Adjective with partial and full reduplication
4.1.2. Classes of adjective

Urdu adjectives can be categorized as descriptive, predicative, attributive, possessive,
demonstrative, and reflexive possessive adjective, explained in following paragraphs:
Descriptive Adjectives: These are the most frequent and important type of adjectives.
They describe attributes of the noun they qualify in terms of its size, dimensions, sound,
color, shade, shape, quality, personal trait, or time, etc.
Some examples of descriptive adjectives in Urdu are given in Table 4.7, where, “‫”ﭼﮭﻮﭨﺎ‬
(chota, little) and “‫( ”ﻟﻤﺒﺎ‬lamba, long) describe the size of a noun, and “‫( ”ﭘﯿﻼ‬peela,
yellow) and “‫( ”ﺳﺮخ‬surkh, red) express the color.
______________________________________________________________________________________
Framework.
Category Examples
Size “‫( ”ﭼﮭﻮﭨﺎ‬chota, little), “‫( ”ﻟﻤﺒﺎ‬lamba, long)
Color “‫( ”ﭘﯿﻼ‬peela, yellow), “‫( ”ﺳﺮخ‬surkh, red)
Shape “‫( ”ﻣﺮﺑﻊ‬muraba, square), “‫( ”ﺗﮑﻮﻧﺎ‬tikona, triangular)
Personal trait “‫( ”اداس‬udaas, sad), “‫( ”ﻣﺠﺒﻮر‬majboor, helpless)
Qualities “‫( ”ﻣﮩﺮﺑﺎن‬mehrbaan, kind), “‫( ”اﭼﮭﺎ‬acha, good )
Table 4.7 Descriptive adjectives in Urdu
Attributive Adjectives: If the descriptive adjectives directly precede a nominal head as

modifiers then they are called attributive adjectives, because, they attributively modify or
restrict the meaning of the noun. For example, the adjective “‫( ”ﭘﯿﻼ‬peela, yellow) modify
the noun “‫(”ﻏﺒﺎره‬ghubara, balloon), to make it “‫( ”ﭘﯿﻼ ﻏﺒﺎره‬peela ghubara, yellow
balloon). In this way the attributive adjective becomes part of the noun phrase. Some
more examples are given in Table 4.8.
Nouns Modified attributively

“‫(”ﻏﺒﺎره‬ghubara, balloon) “‫( ”ﭘﯿﻼ ﻏﺒﺎره‬peela ghubara, yellow balloon)
“‫( ”ﭼﮍﯾﺎ‬chiria, sparrow) “‫( ”اداس ﭼﮍﯾﺎ‬udaas chiria, sad sparrow)
“‫( ”ﺑﺎدﺷﺎه‬badshah, king) “‫( ”ﻣﮩﺮﺑﺎن ﺑﺎدﺷﺎه‬mehrbaan badshah, kind king)
Table 4.8 Attributive adjectives directly modify the nouns
Predicative Adjectives: When the adjectives are used predicatively, they bring in new
information about the noun instead of modifying it.
Nouns With predicative adjectives

“‫(”ﻏﺒﺎره‬ghubara, balloon) “‫( ”ﻏﺒﺎره ﭘﯿﻼ ﮨﮯ‬ghubara peela hay, the balloon is yellow)
“‫( ”ﭼﮍﯾﺎ‬chiria, sparrow) “‫( ”ﭼﮍﯾﺎ اداس ﺗﮭﯽ‬chiria udaas thee, the sparrow was sad)
“‫( ”ﺑﺎدﺷﺎه‬badshah, king) “‫( ”وه ﺑﺎدﺷﺎه ﻣﮩﺮﺑﺎن ﺗﮭﺎ‬woh mehrbaan badshah tha, he was a kind king)
Table 4.9 Predicative adjectives describe the features of the nouns

These are not the component of the noun phrase, but are the complements of a copulative
function, which links them to the noun. Take first example from Table 4.9, “‫”ﻏﺒﺎره ﭘﯿﻼ ﮨﮯ‬
(ghubara peela hay, the balloon is yellow). In this case, the adjective “‫( ”ﭘﯿﻼ‬peela,
______________________________________________________________________________________
Framework.
yellow) identify the color of the noun “‫(”ﻏﺒﺎره‬ghubara, balloon). Only a specific feature
of the noun is described both parts of speech, i.e., adjective and noun remain in their
individual role. Some more examples are given in Table 4.9.
Possessive Adjective: Possessive adjectives are used to indicate the possession. This
possession relation is realized in two ways; whether, adjectives precede the head noun as
modifiers in noun phrases like the attributive adjectives or they may be preceded by a
suitable form of the genitive postposition “‫(”ﮐﺎ‬ka, of), “‫( ”ﮐﯽ‬kee, of), and “‫( ”ﮐﮯ‬kay, of).
These genitive postpositions are lexically independent like “of” in English, but they agree
in number and gender with the object noun. Consider the first example from Table 4.10,
“‫( ”ارﺗﻀﯽ ﮐﺎ ﭘﯿﻼ ﻏﺒﺎره‬Irtaza ka peela ghubara, Itraza’s yellow balloon). In this example the
genitive postposition “‫( ”ﮐﺎ‬ka, of) is used with a singular masculine noun, i.e., “‫”ﭘﯿﻼ ﻏﺒﺎره‬
(peela ghubara, yellow balloon). In the second example, “‫( ”ﻣﯿﺮی‬meri, my) is a
possessive adjective which is used for the first person and in this case is inflected for
gender. Third example also contains the genitive postposition “‫( ”ﮐﺎ‬ka, of) with a singular
masculine noun.
Examples
“‫( ”ارﺗﻀﯽ ﮐﺎ ﭘﯿﻼ ﻏﺒﺎره‬Irtaza ka peela ghubara, Itraza’s yellow balloon)
“‫( ”ﻣﯿﺮی اداس ﭼﮍﯾﺎ‬meri udaas chiria, my sad sparrow)
“‫( ” اﯾﺮان ﮐﺎ ﻣﮩﺮﺑﺎن ﺑﺎدﺷﺎه‬Iran ka mehrbaan badshah, kind king of Persia)
Table 4.10 Examples of possessive adjectives
Demonstrative Adjective: The demonstrative pronouns act as the adjectives to indicate or

demonstrate the specific inherent features of noun/ nouns of a particular type.
Adjectives Examples
“‫( ”اﯾﺴﺎ‬aisa, like this) “‫( ”اﯾﺴﺎ ﻟﺒﺎس‬aisa libas, the dress like this)
“‫( ”وﯾﺴﺎ‬waisa, like that) “‫( ”وﯾﺴﺎ ﻟﺒﺎس‬wasisa libas, the dress like that)
“‫( ”ﺟﯿﺴﺎ‬jaisa, such as) “‫( ”ﺟﯿﺴﺎ ﻟﺒﺎس‬jaisa libas, such dress)
“‫( ”ﮐﯿﺴﺎ‬kaisa, how) “‫( ”ﮐﯿﺴﺎ ﻟﺒﺎس؟‬kaisa libas, what kind of dress)
Table 4.11 Examples of demonstrative adjectives
______________________________________________________________________________________
Framework.
As shown in Table 4.11, the Urdu demonstrative pronouns are different for near “‫”اﯾﺴﺎ‬
(aisa, like this), far “‫( ”وﯾﺴﺎ‬waisa, like that), relative “‫( ”ﺟﯿﺴﺎ‬jaisa, such as) and
interrogative “‫( ”ﮐﯿﺴﺎ‬kaisa, how) demonstrations. These demonstrative adjectives inflect
to agree with the noun for gender and number. These inflections are shown in Table 4.12.
Adjectives Inflected for gender Inflected for number

“‫( ”اﯾﺴﺎ‬aisa, like this) “‫( ”اﯾﺴﯽ‬aisee, like this) “‫( ”اﯾﺴﮯ‬aisay, like this)
“‫( ”وﯾﺴﺎ‬waisa, like that) “‫( ”وﯾﺴﯽ‬waisee, like that) “‫( ”وﯾﺴﮯ‬waisay, like that)
“‫( ”ﺟﯿﺴﺎ‬jaisa, such as) “‫( ”ﺟﯿﺴﯽ‬jaisee, such as) “‫( ”ﺟﯿﺴﮯ‬jaisay, such as)
“‫( ”ﮐﯿﺴﺎ‬kaisa, how) “‫( ”ﮐﯿﺴﯽ‬kaisee, how) “‫( ”ﮐﯿﺴﮯ‬kaisay, how)
Table 4.12 Inflection of demonstrative adjectives
Reflexive possessive adjective: The reflexive possessive adjectives are very frequently
used in agreement with the noun they qualify, i.e., they inflect for gender, number and
case. For example, “‫( ”اﭘﻨﺎ‬apna, own), “‫( ”اﺳﮑﺎ‬uska, someone else’s) and “‫( ”اﺳﮑﺎ‬iska,
someone else’s) are used to indicate one’s own, someone else’s far, and someone else’s
near. The examples of the reflexive possessive adjective “‫( ”اﭘﻨﺎ‬apna, own) are given in
Table 4.13, it is inflected for gender as “‫( ”اﭘﻨﯽ ﭼﺎﺑﯽ‬apni chabee, one’s own key) and for
number as “‫( ”اﭘﻨﮯ ﻟﻮگ‬apnay loag, one’s own people).
Nouns With predicative adjectives

“‫(”ﮔﮭﺮ‬ghar, house) “‫(”اﭘﻨﺎ ﮔﮭﺮ‬apna ghar, one’s own house)
“‫( ”ﭼﺎﺑﯽ‬chabee, kay) “‫( ”اﭘﻨﯽ ﭼﺎﺑﯽ‬apni chabee, one’s own key)
“‫( ”ﻟﻮگ‬loag, people) “‫( ”اﭘﻨﮯ ﻟﻮگ‬apnay loag, one’s own people)
Table 4.13 Examples of reflexive possessive adjectives
4.2. Modifiers
The modifiers intensify the orientation of an adjective. These can be absolute,
comparative or superlative. The modifiers made by postpositions are very frequent in
Urdu writing. For example, the absolute adjective “‫( ”ﻣﮩﻨﮕﺎ‬mehnga, expensive) is
modified by the postposition “‫ ”ﺳﮯ‬to make it comparative; “‫( ”اس ﺳﮯ ﻣﮩﻨﮕﺎ‬is say mehnga,
more expansive). Also, the postposition “‫ ”ﺳﺐ ﺳﮯ‬result into a superlative expression;
______________________________________________________________________________________
Framework.
“‫( ”ﺳﺐ ﺳﮯ ﻣﮩﻨﮕﺎ‬sab say mehnga, most expansive). Some Persian loan words are also
commonly used in inflected forms. For example, “‫( ”ﮐﻢ‬kam, less) is absolute and is
inflected to make comparative “ ”‫(ﮐﻤﺘﺮ‬kamtar, lesser) and superlative “‫”ﮐﻤﺘﺮﯾﻦ‬
(kamtareen, least) expressions. Detailed examples of modifiers are given in Table 4.14.
Modifier Made by postpositions Persian loan words

Absolute ‫(ﻣﮩﻨﮕﺎ‬mehnga, expensive) ‫(ﮐﻢ‬kam, less)
Comparative
(a) ‫ﺳﮯ‬ ‫(اس ﺳﮯ ﻣﮩﻨﮕﺎ‬is say mehnga, more expansive) ‫ ﺗﺮ‬+ ‫ﮐﻢ‬ ‫ﮐﻤﺘﺮ‬
(b) ‫ﺳﮯ زﯾﺎده‬ ‫(اس ﺳﮯ زﯾﺎده ﻣﮩﻨﮕﺎ‬is say ziyadah mehnga) kam + tar (kamtar, lesser)
Superlative
(a) ‫ﺳﺐ ﺳﮯ‬ ‫(ﺳﺐ ﺳﮯ ﻣﮩﻨﮕﺎ‬sab say mehnga, most expansive) ‫ ﺗﺮﯾﻦ‬+ ‫ﮐﻢ‬ ‫ﮐﻤﺘﺮﯾﻦ‬
(b) ‫ﺳﺐ ﻣﯿﮟ‬ ‫(ﺳﺐ ﻣﯿﮟ ﻣﮩﻨﮕﺎ‬sab main mehnga) Kam + tareen  (kamtareen, least)
(c) ‫ﺳﺐ ﺳﮯ زﯾﺎده‬ ‫(ﺳﺐ ﺳﮯزﯾﺎره ﻣﮩﻨﮕﺎ‬sab say ziyadah mehnga)
Table 4.14 Adjective modifiers.
These examples are further elaborated below for the noun “‫( ”ﻟﺒﺎس‬libaas, dress):
a). Absolute
“-‫”ﯾہ ﻟﺒﺎس ﻣﮩﻨﮕﺎ ﮨﮯ‬
Yeh libaas mehnga hay.
This dress is expensive.
b). Comparative
There are two possibilities whether to use “say” or “say zyadah” for comparison between
two objects.
“-‫”ﯾہ ﻟﺒﺎس اس ﺳﮯ ﻣﮩﻨﮕﺎ ﮨﮯ‬
Yeh libaas us say mehnga hay
This dress is more expensive than that.
or
“-‫”ﯾہ ﻟﺒﺎس اس ﺳﮯ زﯾﺎده ﻣﮩﻨﮕﺎ ﮨﮯ‬
Yeh libaas us say zyadah mehnga hay
This dress is more expensive than that.
c). Superlative
For superlatives “sab say” or “sab main”, or “sab say zyadah” are used.
“-‫”ﯾہ ﻟﺒﺎس ﺳﺐ ﺳﮯ زﯾﺎده ﻣﮩﻨﮕﺎ ﮨﮯ‬
Yeh libaas sab say zyadah mehnga hay
______________________________________________________________________________________
Framework.
This dress is the most expensive.

or
“-‫”ﯾہ ﻟﺒﺎس ﺳﺐ ﺳﮯ ﻣﮩﻨﮕﺎ ﮨﮯ‬
Yeh libaas sab say mehnga hay
or
“-‫”ﯾہ ﻟﺒﺎس ﺳﺐ ﻣﯿﮟ ﻣﮩﻨﮕﺎ ﮨﮯ‬
Yeh libaas sab main mehnga hay
Also like Persian grammar, the modifiers “een” or “treen” are used for the same purpose,
e.g. “‫( ”ﺑﮩﺘﺮﯾﻦ‬behtareen, best), “‫( ”ﺑﺪﺗﺮﯾﻦ‬badtareen, worst), “‫( ”ﮐﻤﺘﺮﯾﻦ‬kamtareen, lowest).
“-‫”ﯾہ ﻟﺒﺎس ﺑﮩﺘﺮﯾﻦ ﮨﮯ‬
Yeh libaas behtareen hay
This dress is the best.
4.3. Orientation
Orientation describes the positivity or negativity of an expression, e.g. "‫("اﭼﮭﺎ‬acha, good)
have positive orientation.
4.4. Intensity
This is the intensity of orientation, e.g. “‫( ”ﺑﮩﺘﺮ‬behtar, better) “‫( ”ﺑﮩﺘﺮﯾﻦ‬behtareen, best).
4.5. Polarity
A polarity mark is attached to each word in the lexicon to show the orientation.
4.6. Negations:
Negation is one of the most frequent linguistic structures that change the word, phrase, or
sentence polarity. Negation is not only limited to the negation markers or particles, like,
not, never, or no, but there are various concepts, which serve to negate the inherent
______________________________________________________________________________________
Framework.
sentiments of a comment. Moreover, the presence of the negation influences the

contextual polarity of the words but it does not mean that all of the words conveying
sentiments will be inverted.
There are different forms of negation discussed in the literature. Here, we give three main
forms. Negation can be morphological, i.e., attached as prefix or suffix making a single
lexical unit, e.g., the prefix “‫( ”ﺑﮯ‬bay) as in “‫( ”ﺑﮯ ﭘﺮواه‬bayparwah, careless) is used to
negate the word “‫( ”ﭘﺮواه‬parwah, care). Or, it can be implicit, like, “ ‫ﯾہ ﮔﮭﮍی‬
‫( ”ﺗﻤﮩﺎرےﻣﻌﯿﺎرﺳﮯﮐﻢ ﮨﮯ‬yeh gharee tumharay mayaar say kam hay, this watch is below your
standard/level). This comment even with the absence of a negation particle conveys a
negative opinion. To our knowledge, no research work is available to handle this type of
negation, because it cannot be handled automatically.
Lastly, the negation can be explicit with the use of negation particles, e.g., “ ‫ﯾہ ﮔﮭﮍی‬
‫( ”ﺗﻤﮩﺎرےﻣﻌﯿﺎرﮐﮯﻣﻄﺎﺑﻖ ﻧﮩﯿﮟ‬yeh gharee tumharay mayaar kay mutabiq naheen, this watch is
not according to your standard/level). In this comment, the negative effect is conveyed by
the negation particle “‫( ”ﻧﮩﯿﮟ‬naheen, not), which can be determined automatically. Most
of the efforts for an automatic treatment of the negation for sentiment analysis give
attention to the last type, in which the negation appears explicitly.
4.6.1. Negation in Urdu language

In our work, we focus on the negation which appears explicitly in the given text through
negation particles. In Urdu, both sentential and constituent negation exists. Some
prominent negation particles are “‫( ”ﻣﺖ‬mat, don’t), “‫( ”ﻧﺎ‬na, no) “‫( ”ﻧﮩﯿﮟ‬naheen, not), “‫”ﺑﻨﺎ‬
(bina, without), and “‫( ”ﺑﻐﯿﺮ‬baghair, without).
Sentential Negation:
The negative particles “‫( ”ﻧﮩﯿﮟ‬naheen, not), “‫( ”ﻣﺖ‬mat, don’t) and “‫( ”ﻧﺎ‬na, no) are used to
express sentential negation. The particle “‫( ”ﻧﮩﯿﮟ‬naheen, not) appears before the main
verb, which may or may not be followed by an auxiliary verb. In imperative
constructions, the particles “‫( ”ﻣﺖ‬mat, don’t) and “‫( ”ﻧﺎ‬na, no) are used in the preverbal
______________________________________________________________________________________
Framework.
position. Table 4.15 gives the use of these negation particles before the main verbs; “‫”ﺟﺎﺗﺎ‬
(jata, goes) and “‫( ”ﭘﮍھﻮ‬parho, read).
Examples
“‫( ”وه ﺳﮑﻮل ﻧﮩﯿﮟ ﺟﺎﺗﺎ ﮨﮯ‬who school naheen jata hay, He doesn’t go to the school.)
“‫( ”ﮐﺘﺎب ﻣﺖ ﭘﮍھﻮ‬kitaab mat parho, Don’t read the book.)
“‫( ”ﮐﺘﺎب ﻧﺎ ﭘﮍھﻮ‬kitaab na parho, Don’t read the book.)
Table 4.15 Examples of sentential negation from Urdu text.
Constituent Negation:
The constituent negation is used to negate some particular constituent/constituents of a
sentence. Usually the negative particle comes after the negated constituent. Some
common constituent negation particles are; “‫( ”ﻧﮩﯿﮟ‬naheen, not), “‫( ”ﻣﺖ‬mat, don’t), “‫”ﻧﺎ‬
(na, no), “‫( ”ﻋﻼوه‬ilaawa, except), “‫( ”ﺳﻮا‬siva, except) and “‫( ”ﺑﻨﺎ‬bina, without). In Table
4.16, the negation particles, “‫( ”ﻧﮩﯿﮟ‬naheen, not), “‫( ”ﻣﺖ‬mat, don’t), “‫( ”ﻧﺎ‬na, no) and “‫”ﺳﻮا‬
(siva, except) are used after the negated constituent.
Examples
“‫( ”ﮐﯿﻤﺮه ﮐﺎﻻ ﻧﮩﯿﮟ ﻧﯿﻼ ﮨﮯ‬camera kala naheen neela hay,
camera is blue, not black)
“‫ﻧﺎ ﺧﺮﯾﺪواﻧﺎر ﺧﺮﯾﺪو‬/‫( ”اﻧﮕﻮر ﻣﺖ‬angoor mat/na khareedo anar khareedo,
don’t buy grapes, buy pomegranate)
“‫( ”ﻣﻮﺑﺎﯾﻞ ﮐﮯ رﻧﮓ ﮐﮯ ﺳﻮا ﺳﺐ اﭼﮭﺎ ﮨﮯ‬mobile kay rang kay siwa sab acha hay,
everything is fine with the mobile except its color)
Table 4.16 Examples of constituent negation from Urdu text.
Use of multiple negation particles:

Sometimes the double negation marks are used to put emphasize on something. For
example, in the sentence, “‫( ”وه ﺳﮑﻮل ﻧﮩﯿﮟ ﻧﺎ ﮔﯿﺎ‬woh school naheen na gya, he did not go to
school. The two negation particles “‫( ”ﻧﮩﯿﮟ‬naheen, not) and “‫( ”ﻧﺎ‬na, no) are used to give
stress or emphasize.
Negation in coordinate structures:
______________________________________________________________________________________
Framework.
In the coordinate structures the negation particle does not move to the coordinate point,
unless the identical element is deleted from the second negative conjunct. But, in the
situation like ‘neither … nor’, it appears in the beginning position. For example, “ ‫ﻧﺎ ﮔﮭﺮ ﻧﯿﺎ‬
‫( ”ﮨﮯ ﻧﺎ ﮨﻮادار‬na ghar nya hay, na hawa daar, The house is neither new and nor
ventilated).
Hence, in Urdu negation particles exist at both levels, i.e., sentential and constituent, like
in English, but their use in the sentence structure is different.
4.7. SentiUnit extraction model

As already discussed for our approach of automatic sentiment classification, based on
subjective phrase or appraisal expressions, we give emphasis, on the accurate
identification of the SentiUnits.
Figure 4.2 SentiUnit extraction and polarity computation.
______________________________________________________________________________________
Framework.
The model is grammatically motivated and works on the grammatical structure level of
the sentences. It uses a sentiment-annotated lexicon based approach for the identification
of such expressions from the corpuses of Urdu text based reviews (see Figure 4.2). The
adjectives, their modifiers and polarity shifters like explicit negation particles, e.g., “‫”ﻧﮩﯿﮟ‬
(naheen, not), “‫( ”ﻣﺖ‬mat, no), “‫( ”ﻧﺎ‬na, no) etc, are handled within these expressions.
For a given Urdu language based review, the SentiUnit extraction and polarity
computation takes place in three phases.
a. Firstly, the normalized text is passed to the parts-of-speech (POS) tagger, which
assigns POS tags to all the terms. Along with this tagging the word polarities are also
annotated to the subjective words. This polarity annotation takes place with the help
of the sentiment annotated lexicon of the Urdu text.
b. These annotated subjective terms (adjectives) are considered as the headwords for the
next phase in which shallow parsing is applied for phrase chunking and the adjectival
phrases are chunked out. Now, these chunks are converted into SentiUnits by
attaching the negation, modifiers, conjunctions, etc.
c. In the last phase, the identified SentiUnit are analyzed for polarity computation. The
polarity of the subjective terms is treated with the combined effect of the negation, if
it exists in the SentiUnit. Hence, the overall sentiment or impact of the SentiUnit is a
combination of its constituents.
4.8. The appraisal targets

So far we have discussed the adjectives and adjectival phrases as the SentiUnits, which
express an attribute of a noun (place, thing or, person). Noun is a fundamental part of
speech (POS), for which, the opinion is made. If the linkage between the noun and
adjective is not correctly identified then there is a great possibility of misclassification or
error about the exact meaning of the opinion. We call these nouns or noun phrases as the
targets of the appraisal. Commonly, in the sentence structure adjectives to noun
association appear in two ways.
 The adjectives are directly linked with the noun within the noun phrase
 They associate with the noun through some other part of speech, e.g., verb.
______________________________________________________________________________________
Framework.
In both cases they describe the characteristic features of the noun they qualify. The
following section describes the characteristics and structure of the noun phrases in the
Urdu language.
4.8.1. Cases of noun phrases

The core case markers change the case of the NP into four different types, i.e.,
nominative, ergative, dative and accusative, summarized in Figure 4.3.
Cases of NP with core

case marers
aNominative: There is
no case marker with
NP; the noun is in
nominative case
bErgative: NP marked
with case marker “ ‫ﻧﮯ‬
” (ne) in an actor role
cDative: NP marked
with “‫( ”ﮐو‬ko) in an
indirect object or
receiver role
dAccusative: NP
marked with “‫( ”ﮐو‬ko)
Figure 4.3 Cases of noun phrases
in a direct with
object rolecore case markers
4.8.2. Possession markers in noun phrases

In Urdu the genitive markers or postpositions are used as the possession markers (PM).
There are three possession markers in Urdu, as shown in Table 4.17.
In literature the position markers are considered different from case markers due to
following features:
 In a noun phrase the possession markers come between two nominals. For examples,
in “‫( ”ﻓﻠﻢ ﮐﺎ ﻧﺎم‬film ka naam, name of the movie), the possession marker “‫( ”ﮐﺎ‬ka)
comes between the nominal “‫ ”ﻗﻠﻢ‬and “‫”ﻧﺎم‬.
______________________________________________________________________________________
Framework.
 A possession marker indicates that in a noun phrase the first nominal is the possessor
or holder of the second nominal.
 The second nomainal in the noun phrase change the form of the possession marker. It
means the first nominal is in the oblique form and the second is with the number-
gender agreement. For example, in the noun phrase “‫( ”ﻓﻠﻢ ﮐﺎ ﻧﺎم‬film ka naam, name of
the movie), the possession marker “‫( ”ﮐﺎ‬ka) agrees with the second noun “‫”ﻧﺎم‬, which
is singular masculine.
As the possession markers are not restricted by a verbal predicate, so they do not directly
mark a grammatical function.
PM Gender, number Example
‫( ﮐﺎ‬ka) Masculine, singular ‫( ﻓﻠﻢ ﮐﺎ ﻧﺎم‬film ka naam, name of the film)

‫( ﮐﯽ‬kee) Feminine, singular ‫( ﻓﻠﻢ ﮐﯽ ﮐﮩﺎﻧﯽ‬film kee kahani, story of the film)
Feminine, plural ‫( ﻓﻠﻢ ﮐﯽ ﮐﮩﺎﻧﯿﺎں‬film kee kahanian, stories of the film)
‫( ﮐﮯ‬kay) Masculine, plural ‫( ﻓﻠﻢ ﮐﮯ ﮐﺮدار‬film kay kirdar, characters of the film)
Table 4.17 Possession markers in Urdu.
4.8.3. Effect of complex noun phrases in Urdu text

EXTRACTOR module of the system (explained in Chapter 5) identifies the targets
through shallow parsing based chunking. These targets are the non-overlapping noun
phrases “‫( ”اﺳﻤﯽ ﺗﺮﮐﯿﺐ‬ismi tarkeeb) present in the text. Noun phrases are the units of one
or more words in a link with noun as head word and all other words as dependents. Urdu
noun phrases exhibit variations in structure and complexity level. Even a noun phrase can
include other phrases as its components, e.g., adjectival and genitive phrases etc. In
addition to internal complexity of the noun phrase its position in the sentence is not
always the same. This is due to the free word order property of Urdu text (Rizvi and
Hussain 2005). Hence, the chunker for Urdu noun phrases must be capable of handling
both aspects simultaneously.
Example:
The following sentence contains a complex noun phrase.
______________________________________________________________________________________
Framework.
“-‫”ارﺗﻀﯽ ﮐﺎ ﮐﮭﻠﻮﻧﺎ روﺑﻮٹ ﺷﺎﻧﺪار ﮨﮯ‬

Irtaza ka khilona robot shandar hay
The toy robot of Irtaza is wonderful
Description:
In this sentence a complex noun phrase is used which is based on three nouns “‫”ارﺗﻀﯽ‬
(Irtaza, proper noun), “‫( ”ﮐﮭﻠﻮﻧﺎ‬khilona, toy) and “‫( ”روﺑﻮٹ‬robot, robot) with a possession
marker “‫( ”ﮐﺎ‬ka, of).
‫ = ارﺗﻀﯽ ﮐﺎ ﮐﮭﻠﻮﻧﺎ روﺑﻮٹ‬NP
Irtaza ka khilona robot
The SentiUnit in the sentence is single adjective based with positive orientation, i.e.,
“‫( ”ﺷﺎﻧﺪار‬shandar, wonderful).
Chapter review:
SentiUnits are described in Chapter 4 in detail as the sentiment carrier expressions. A
general model used for the identification of a subjective sentence or opinion with
identifiable appraisal expressions is based on three units, i.e., source of appraisal,
appraisal expression, and finally target of appraisal. This model is defined in detail in
next Chapter, where the source of appraisal in a given review is the reviewer and the
target is the entity about which the appraisal is made. For our approach of sentiment
analysis of the Urdu language is grammatically motivated and incorporates a sentiment-
annotated lexicon for the identification of the sentiment carrier expressions or the
appraisal expressions in a sentence.
______________________________________________________________________________________
Framework.
Chapter 5| Implementation: Classification Model and Lexicon Structure 66
CHAPTER 5
IMPLEMENTATION: CLASSIFICATION MODEL AND

LEXICON STRUCTURE
In this Chapter, we present our sentiment classification model, which handles a

morphologically rich language; Urdu. Our model is grammatically motivated and
employs a sentiment-annotated lexicon based classification approach for the
identification of the sentiment carrier expressions in a sentence, called the SentiUnits.
The sentence subjectivity is based on these expressions and all the other terms are
considered neutral. Therefore, the subjective polarity of a sentence is computed by the
polarities of its constituent SentiUnits. We partition logically a single opinionated
sentence into three units:
1. Source of appraisal
2. SentiUnit (the appraisal expression)
3. Target of this appraisal.
First, we extract the SentiUnits and the targets, and then these targets are associated with
the respective SentiUnits.
We break up the task of sentiment analysis into four modules, as shown in Figure 5.1.
The PREPROCESSOR module identifies the word boundaries and segments the sentence
into the meaningful words or lexical units. The out put of the PREPROCESSOR goes to
the EXTRACTOR module as an input. The EXTRACTOR extracts the sentiment
expressions and the noun phrases, as the SentiUnits and the Targets, respectively. Then,
the ASSOCIATOR module is responsible for linking the candidate targets to each
extracted SentiUnit.
Finally, the CLASSIFIER identifies polarities of the SentiUnits in each sentence and
calculates the overall sentiment of the review as a sum of sentence polarities.
______________________________________________________________________________________
Framework.
Figure 5.1 System model representing modules and their interactions (Syed et al 2012).
______________________________________________________________________________________
Framework.
5.8. PREPROCESSOR
In general, for natural language processing applications, the preprocessing phase deals
with; the removal of punctuation marks, or omitting other unnecessary symbols and
striping of HTML tags. In addition to these tasks, our PREPROCESSOR module has to
handle the diacritics and word boundary identification issues, which are specific to Urdu
language.
5.1.1. Diacritic omission

Similar to the other Arabic script based languages (Persian, Turkish, Sindhi, and Punjabi)
Urdu script consists of two classes of symbols: letters and diacritics. Just like the letters,
the diacritics are also useful for readability and understanding of the script. They not only
represent the vowels, but affect the meanings of the words. However in writings, these
symbols are optional and this is observed that some authors use some diacritics regularly
and totally ignore the others. Even the over use of a particular kind is very common.
Hence, their use is highly author dependent. This under and over use and sometimes
absence of diacritics adds to the morphological as well as lexical ambiguity of the
language. For example, the task of POS tagging of the diacritic bearing words can
generate incorrect results due to ambiguous meaning. This issue is considered as an
unresolved critical problem in linguistics research and hence, as a regular practice, the
diacritics are removed as a part of preprocessing phase (Durrani and Hussain 2010).
5.1.2. Word boundary identification

In almost all natural language processing applications, word segmentation or word
boundary identification through tokenization is the foremost obligatory task.
Tokenization is easy to implement for languages in which word boundaries are identified
through punctuation marks or white spaces, e.g., Spanish, English, and French (Lehal
2010). In such languages, the input sentence is considered as a sequence of letters, which
determine a sequence of the words, i.e., < w1, w2, w3 ... wi > → < l1, l2, l3 ... lj >. Each
sentence is segmented into the lexical words with the help of word boundaries. But, this
______________________________________________________________________________________
Framework.
process becomes complicated, if white spaces or other word delimiters are rarely or never
used as word boundaries.
As we already mentioned in Chapter 4, Urdu orthography is context sensitive. The
“‫( ”ﺣﺮوف‬haroof, alphabets) are divided in two categories as joiners and non joiners. The
joiners take multiple glyphs and shapes according to the context, which cause word
boundaries identification issues. The work in (Durrani and Hussain 2010) divides the
word segmentation of Urdu text into two sub problems as, i.e., space insertion and space
deletion.
i) Space-insertion
Many words in Urdu are made by more than one ligature (usually two). Semantically and
syntactically these ligatures are part of a single word. If the last letter of the first ligature
in a word is a joiner then it tends to join with the first letter of the second ligature. To
avoid this joining, a space is inserted by the writer.
This causes space insertion errors, e.g., “‫( ”ﺧﻮش ﺑﺎش‬khush bash, happy), is a single word
with two ligatures, L1= “‫ ”ﺧﻮش‬and L2 = “‫”ﺑﺎش‬. The last letter of L1 “‫ ”ش‬is a joiner which
tends to join with first letter in L2 “‫ ”ب‬to avoid this joining a space is inserted while
typing the word. On omitting this space we get “‫”ﺧﻮﺷﺒﺎش‬, whish is not a correct word, so
the writer cannot avoid the space.
ii) Space-omission
There are many words which end with non-joiner letters. As the non-joiner letters keep a
constant shape so usually the writers do not insert spaces while writing the next word to
identify word boundary. This does not affect the readability of the words but for
computational tasks the boundary identification becomes an issue as both words are
written in continuation without space. For example, the phrase, “‫( ”ﺷﯿﺮاورﺑﮑﺮی‬shair aur
bakri, lion and goat) is written without, and “‫( ”ﺷﯿﺮ اور ﺑﮑﺮی‬shair aur bakri, lion and
goat) is with spaces. We rewrite the phrase with the symbol “|” to indicate the word
boundaries “‫( ”ﺷﯿﺮ| اور| ﺑﮑﺮی‬shair aur bakri, lion and goat).
For the Urdu language, the word segmentation is handled by most of the researches as the
part of a major task, i.e., morphological analyzer, POS tagger, and translators etc. A few
contributions dealt with this issue as an independent task, for example, (Durrani and
______________________________________________________________________________________
Framework.
Hussain 2010), (Lehal 2010) and (Lehal 2009). Particularly, (Durrani and Hussain 2010)
presents; a detailed literature survey for the identification of the inherent causes and then
propose a word segmentation model.
According to the above discussion and the previous realized works, we propose to
perform the PREPROCESSOR task in four steps, as shown in Figure 5.2. First of all the
normalization is performed on the given text for the removal of symbols and tags. Then,
diacritic omission is performed to avoid ambiguity. Thirdly, the sentence is tokenized as
a sequence of orthographic words OW = ow1, ow2… own, where the words ow1, ow2, ...
are not grammatical or meaning full words but these are only orthographically separated
from each other.
This sequence becomes the input to the final segmentation module. The result of
segmentation is a sequence of meaning full and grammatically correct words ready for
further processing.
Figure 5.2 Preprocessing of the input sentence by the PREPROCESSOR module (Syed et al 2012).
______________________________________________________________________________________
Framework.
5.9. EXTRACTOR
The EXTRACTOR module identifies and extracts the SentiUnits and the targets. Two
subtasks are performed:
 Extracting SentiUnits with Adjectives as head words
 Extracting targets with Nouns as head words
The extractor module uses shallow parsing based text chunking. This method identifies
the beginnings and ends of grammatical phrases without parsing the full phrase structure.
Hence, the EXTRACTOR shallow parse each sentence in the given review to find
adjective or noun phrases and then work out for attributes (modifiers, orientation,
intensity, etc.) modeling the behavior of the modifiers and the negations within the
phrase.
Figure 5.3 Processing of the input sentence by EXTRACTOR module (Syed et al 2012).
For extracting SentiUnits, the parser starts with a lexicon of nominal and adjectival head
words, which define initial values for orientation whether positive or negative. In addition
to positive or negative orientation head words exhibit the intensity of orientation. It
searches for occurrences of these head words in the sentence, and upon finding them it
______________________________________________________________________________________
Framework.
moves rightward to attach modifiers because the modifiers appear in the right side of the
adjectives in Urdu. Now, it searches for the polarity shifters or negations and finally
distinguishes the whole subjective expression. Likewise the parser identifies candidate
targets with the help of lexicon. It finds the entire target groups matching words specified
in the lexicon. These steps are given in Figure 5.3.
5.10. ASSOCIATOR
Figure 5.4 The dependency parsing of the given sentence (Syed et al 2012).
The extracted SentiUnits and targets are associated with each other through
ASSOCIATOR. We apply dependency parsing for this purpose. Figure 5.4 shows the
dependency parsing of the sentence;
“‫”ﻟﮍﮐﺎ ﮐﻤﭙﯿﻮﭨﺮ اور اﻟﯿﮑﭩﺮوﻧﮑﺲ ﮐﯽ ﭼﯿﺰﯾﮟ ﺑﯿﭽﺘﺎ ﮨﮯ‬
larka computer aur electronics kee cheezain baichta hay.
The boy sells computer and electronic products.
5.3.1. Working of the ASSOCIATOR
______________________________________________________________________________________
Framework.
First the nominal group that is the lexical representation of the target is identified and
then the values of the attributes describing that target are computed. ASSOCIATOR finds
the target phrase by following the paths through a dependency parse of the sentence. The
result of the dependency parse is a ranked list of paths or linkage specifications. These
specifications are ranked to specify the order in which the links should be traversed. For
each SentiUnit, the system looks for the paths through the dependency tree which
annotate any word in the SentiUnit to the next or final expected word according to the
specification of that particular link. With the identification of a word in the proper
syntactic place, the shallow parsing is applied moving rightward to find a noun phrase
that ends in the identified word. These steps are shown in Figure 5.5.
Figure 5.5 Linking SentiUnits with candidate targets by ASSOCIATOR module (Syed et al 2012).
5.3.2. Algorithm
Hence, the steps performed by ASSOCIATOR are:
Input: Shallow parsed sentence with extracted SentiUnits and targets.
Processing: Apply dependency parse and then,
1. Search all the linkages such that;
a. The linkage is in the linkage specifications
b. The linkage connects to a chunked SentiUnit
______________________________________________________________________________________
Framework.
c. The linkage need not connect to chunked target

2. For each chunked SentiUnit;
a. If there exists any linkage to the chunked target then,
b. Remove unconnected linkages
3. Select the linkage according to priority of linkage specifications.
Output: One linkage per SentiUnit.
5.11. CLASSIFIER
The CLASSIFIER starts from calculating the intensity of orientation of the SentiUnits by
comparing each tagged word with the polarity values assigned in the lexicon entries. For
example, the expression “‫( ”ﺑﮩﺖ اﭼﮭﯽ ﮐﺘﺎب‬bohat achi kitab, very good book) is more
intense than “‫( ”اﭼﮭﯽ ﮐﺘﺎب‬achi kitab, good book) due to the modifier “‫”ﺑﮩﺖ‬, (bohat, very)
and both are positive expressions. In this expression, the SentiUnit “‫( ”ﺑﮩﺖ اﭼﮭﯽ‬bohat
achi, very good) is associated with the target “‫( ”ﮐﺘﺎب‬kitab, book).
The CLASSIFIER look for other associations identified by the ASSOCIATOR, then it
calculates the polarity value for each association for a particular target, e.g., “‫( ”ﮐﺘﺎب‬kitab,
book) in this case. If “‫( ”ﺑﮩﺖ اﭼﮭﯽ‬bohat achi, very good) is the only expression in the
sentence showing sentiments about the target then the sentence polarity is equal to the
polarity of this expression otherwise other possible expressions are also evaluated. The
calculation of polarity is summation of either positive or negative expressions with
positive or negative values respectively.
5.4.1. Working of the CLASSIFIER

According to the problem statement, the given review, R may be a single sentence based
or it may contain multiple sentences, among which, some are subjective sentences in the
set Ss= {Ss1, Ss2, Ss3,….Ssk} and others are objective So= {So1, So2, So3,….Sol}, such that,
R = {Ss1, Ss2, Ss3… Ssk} U {So1, So2, So3… Sol.},

where,
______________________________________________________________________________________
Framework.
k=1, 2, 3 …n; l=1, 2, 3 …m; n and m are finite numbers.

The final polarity of the review PR is calculated as a sum of all sentence polarities
computed by the CLASSIFIER module. If Psi represent the sentence polarities of i
sentences then,
PR = ∑ Psi ,
where,
i=1, 2, 3 …N; N is a finite number.
5.4.2. Algorithm:
Hence, the CLASSIFIER module is divided into two steps as given next,
Step1: Compute sentence polarity
Input: Dependency parsed sentence with SentiUnits to targets associations.
Processing: Start with any one SentiUnit of a particular target
a. COMPARE each word in the SentiUnit with the lexicon to find its orientation and
polarity value;
b. COMPUTE SentiUnit polarity by adding polarities of the words according to the
intensity values
c. LOOK FOR another SentiUnit for the same target
d. Sentence polarity = SUMMATION of all SentiUnits’ polarities for a particular
target
Step2: Compute total polarity of review
a. REPEAT step 1 for all sentences
b. ADD all polarity values to calculate PR.
c. COMPARE with threshold
Case a: If PR > threshold, then R is positive.
Case b: If PR < threshold, then R as negative
Output: Classification of review as positive or negative.
______________________________________________________________________________________
Framework.
5.12. Computation of SentiUnit polarity: Effect of polarity shifters
For the purpose of sentiment classification, the classifier is integrated with the lexicon of
annotated words (discussed in Section 5.6). In such a lexicon, a polarity mark is
annotated with each lexical entry to show its orientation and intensity. This is called the
prior polarity of the subjective words and phrases. The overall orientation of a sentence
is calculated by recognizing the prior polarities of the constituent subjective terms. This
idea works well in some simple sentences, particularly, if the polarity shifters are not
present. The polarity shifters are the words and phrases, which can change the prior
polarities of the words in a sentence.
Example:
Consider the sentence:
“‫”ﻣﯿﺮا ﮐﯿﻤﺮا ﮐﻢ ﻗﯿﻤﺖ ﮨﮯ ﻟﯿﮑﻦ اﺳﮑﯽ ﺑﯿﭩﺮی دﯾﺮﭘﺎ ﻧﮩﯿﮟ‬
mera camera kam-qeemat hay laykin iski battery derpa naheen
My camera is inexpensive but its battery is not long lasting
Description:
In this sentence, the word “‫( ”دﯾﺮﭘﺎ‬derpa, long lasting) have positive prior polarity, but due
to the use of polarity shifter “‫( ”ﻧﮩﯿﮟ‬naheen, not), its overall contribution to the sentence’s
sentiment becomes negative. Another example of the polarity shifter in the above
expression is the word “‫( ”ﻟﯿﮑﻦ‬laykin, but), which, alters the positive prior polarity of the
word “‫( ”ﮐﻢ ﻗﯿﻤﺖ‬kam-qeemat, inexpensive). This overall polarity of the appraisal
expression is named as the SentiUnit polarity. Therefore, our approach of sentiment
classification rests on two types of polarity scores:
 Prior polarity: Polarity marks annotated with the lexicon entries.
 SentiUnit polarity: The overall polarity of the appraisal expression on which the final
polarity of the sentence depends
At the highest level, our lexicon model categorizes all the lexical entries into objective
terms and the subjective terms. Objective terms have no orientation or intensity and
hence are not marked with the prior polarity scores. Therefore, they demonstrate no effect
on the overall decision of the classification. On the contrary, subjective terms are the
______________________________________________________________________________________
Framework.
carriers of the sentiments and are marked with polarity scores. Their occurrence can
effect or even altogether alter the final classification decision.
Figure 5.6 Sentiment classification of a review as positive or negative.
The algorithm identifies the subjective words according to the prior polarities, annotated
in the lexicon. Then, it attaches the polarity shifters, conjunctions, postpositions and
modifiers to extract the appraisal expressions in the opinionated sentences. These
expressions are labeled as the SentiUnits. The shallow parsing based chunking is applied
for the extraction of the SentiUnits, with adjectives as the head words. The overall
polarity of a sentence in a given review can be determined by computing the polarity of
these expressions. Let us denote the term’s prior polarity with Tp, SentiUnit’s polarty
______________________________________________________________________________________
Framework.
with SUp, Sentence polarity with Sp, and overall review polarity with Rp, as shown in
Figure 5.6 and 5.7.
5.5.1. Computing overall review polarity Rp from SUp
The Figure 5.7 shows the overall process of the review polarity calculation.
Figure 5.7 Computation of the overall polarity of the Urdu text based review (Syed et al 2011 a)
______________________________________________________________________________________
Framework.
When the system is given a review for classification it sets the review polarity Rp and
sentence count SCount to zero. Then, its takes each sentence one by one. The analysis
begins with the text normalization resulting into word segmentation. These words are
passed to the SentiUnit extraction and polarity computation module, which gives polarity
annotated SentiUnits. Now, the sentence polarity Sp is computed using the polarities of
its constituent SentiUnits. The total Rp is the sum of all known sentence polarities Sp.
Then, Rp is compared with the threshold value. If Rp is greater than the threshold, then,
the review is positive and vice versa.
5.13. Sentiment Annotated Lexicon
Natural language processing applications use electronic versions of the lexicons or the
machine readable versions. The lexical level require this lexicon, and the particular
approach adopted by the system decides whether a lexicon will be employed, as well as
the extent, nature and level of information that is encoded in that lexicon.
Lexicons may be relatively simple, with only the words and their lexical category (part of
speech), or may be increasingly intricate and include information about the semantic
classes of the word, its arguments, the semantic limitations on these arguments,
definitions of the sense or senses in the semantic representation employed in a certain
system, and it can even hold each sense of a single word for word sense disambiguation.
A usual model of a sentiment analyzer with a sentiment-annotated lexicon incorporates
two components:
(i) The classification model, which analyzes and classifies the given opinionated text
according to inherent sentiments of the reviewer (given in previous sections), and
(ii) The lexicon or lexicons annotated with the prior polarities of the lexical entries
(words/ phrases), usually as positive or negative.
These prior polarity annotated lexicons are also called sentiment-annotated lexicons
(Pang and Lee, 2008). These can be manually compiled like General Inquirer (Stone et al.
1966), a prominent recourse used in English sentiment analysis based research and
applications. Alternatively, such lexicons can be automatically generated. A considerable
______________________________________________________________________________________
Framework.
percentage of research has emerged in the sentiment annotated lexicon construction

within a few years. For example, (Annett and Kondrak, 2008), (Higashinaka et al., 2007),
(Andreevskaia and Bergler, 2006), (Hu and Lui, 2005), (Yu and Hatzivassiloglou, 2003),
(Riloff et al., 2003), Turney (2002), and (Hatzivassiloglou and Wiebe, 2000). These
contributions have proposed a variety of approaches for the lexicon development, their
structures and the relationships between the entries.
In English text, the corpuses of reviews, movies and other kind of information are readily
available on the product, movie, news, or discussion websites. There are two benefits of
these readily available texts; firstly they can be used as test beds to analyze the
performance of any kind of sentiment analyzer, and secondly, these are very helpful in
the automatic generation of the domain specific prior polarity lexicons.
On the contrary, Urdu is a recourse poor language (Muscand and Ghosh, 2010). Most of
the data available in the Urdu language is in the image formats [Hussain] or is not
suitable for sentiment analysis, because for the generation of a prior polarity lexicon, we
need opinionated texts, like reviews.
Therefore, the task of domain specific prior polarity lexicon development for Urdu text,
poses many challenges. To our knowledge, no Urdu words based prior polarity lexicon is
accessible. Though, some contributions are available, which have tried to develop simple
lexicons, suitable for other language processing applications of the Urdu text. For
example, (Ijaz and Hussain, 2007), (Humanyoun et al., 2007), (Muaz and Hussain, 2009)
and (Muscand and Ghosh, 2010).
5.6.1. Definitions of the specific terms

Before presenting our model of the lexicon of Urdu words we consider it mandatory to
define certain terms, like lexicon, lexeme, lemma etc.
Definition1:
In linguistics a lexicon is defined as the set of all the morphemes of a particular language.
More specifically it can be a collection of terms used in a particular profession, subject,
______________________________________________________________________________________
Framework.
or style; a vocabulary: the lexicon of Greek mythology. More formally, it is a language's

inventory of lexemes.
Definition 2:
A lexeme is a conceptual unit of the morphological analysis, which corresponds to a set
of the forms taken by a single word. Generally, a lexeme belongs to a specific syntactic
class and has a definite semantic value. In case of inflecting languages (such as Arabic,
Urdu, Turkish etc), it has a related inflectional paradigm, so, a lexeme in many languages
will have many different forms.
Example:
As an example, consider the lexeme WALK from the English language; this lexeme have
different forms i.e., walk, walks, walked and walking.
The grammar rules of a language govern the forms of the lexemes, which include,
compound tense rules and subject-verb agreement. For example, walks is the present third
person singular form of the lexeme WALK, whereas, walked is its past form.
Definition 3:
The morphology (defined and discussed in Chapter 3) is also based on the notion of the
lexeme, which, further describes many other terms. For example, in terms of lexemes the
morphological operations (already defined in Chapter 3) can be stated as; inflectional
rules relate a lexeme to its forms and derivational rules relate a lexeme to another
lexeme.
Definition 4:
In dictionaries, conventionally, a lexeme is presented as the lemma, which is a canonical
form of a lexeme and is used as the headword. Other forms of the lexeme that are not
common conjugations of the word are often listed later in the lexical entry.
Definition 5:
A lexical entry is a single word or chain of words that formulates the basic elements of a
lexicon. The single word lexical entries are lion, computer and finger. Whereas, traffic
______________________________________________________________________________________
Framework.
signal, life style, bits and pieces, and take care of , etc are the examples of the chains of
the words.
Much as a lexeme, the lexical entries generally, express a distinct meaning but are not
limited to single words.
5.6.2. Sentiment annotated lexicon of Urdu words

Our approach distinguishes clearly between the subjective and objective entries in the
lexicon. We take two attributes of a subjective entry; i.e, its orientation (either positive or
negative) and its intensity (the force of the orientation).
After development of the lexicon we integrate it with the sentiment classifier. The
classifier preprocesses the given text and then applies shallow parsing based chunking. It
uses lexicon for comparing all the words/phrases present in the text. As a result, all the
subjective terms in the given text become annotated. On the basis of the polarities of
individual words, the sentence and then the total review polarity is calculated. We
evaluate the overall system using a corpus of movie reviews in Urdu language. The
classification algorithm is applied on the review corpus. Each subjective word in the
review is compared with lexicon entries for the computation of the polarity scores.
i) Construction Steps
We divide the lexicon construction task into following steps:
 Categorize the words either subjective or objective. We have identified two categories
of lexicon entries. When we apply classification algorithm on these words then the
classifier simply ignores objective terms, in this way its performance totally depends
upon subjective words. For example, “‫( ”ﻣﻮم‬mome, wax) is an objective and “‫”ﻋﻤﺪه‬
(umdah, fine) is a subjective word.
 Categorize these words according to morphological rules, which work at the word
level. This categorization helps in identifying the subjective terms from the given
text. For example, rules for marking of an adjective with the noun it qualifies, etc.
______________________________________________________________________________________
Framework.
 Identify their grammatical rules, which describe the possible structures of a sentence
and position of the parts of speech with respect to each other. For example, use of
modifiers with adjectives or use of auxiliaries with verbs, etc.
 Discover relationships between different lexicon entries. These relationships can
define synonyms, antonyms and cross references, etc.
 Decide and annotate polarities and then intensities to the entries. In this task first the
entries are categorized as positive or negative then their intensity scores are attached
to them. Some entries have only orientations and some have only intensities (like
modifiers) and some have both values.
ii) Lexicon Structure

The model is designed to distinguish between the objective and subjective terms in a
given review. Objective terms are with neutral sentiments, which have no effect on the
final decision of the classification and subjective terms are considered as the carriers of
the sentiments and their presence can alter the final classification. Keeping this
distinction in view our lexicon entries are categorized as objective and subjective terms.
Before going into the details we define some terms according to our approach:
Orientation: Orientation describes either the positivity or the negativity of a lexicon

entry. For most of the entries orientation is predefined during lexicon construction phase.
But, in a given text it can be altered with the use of a polarity shifter in the sentence, e.g.
the word “‫( ”اﭼﮭﺎ‬acha, good) have positive orientation but, with the polarity shifter “‫”ﻧﮩﯿﮟ‬
(naheen, not), it becomes a negative expression, i.e., “‫( ”اﭼﮭﺎ ﻧﮩﯿﮟ‬acha naheen, not good).
Moreover, the orientation of some words (though their number is few) is highly domain
specific or depends upon the context within which they are used. But, these two issues
are beyond the scope of our research.
Intensity: This is the intensity of orientation of a lexicon entry. This describes the force of
positivity or negativity of a term. Usually, the modifiers, e.g., “‫( ”ﺑﮩﺖ‬bohat, more)
describe the intensity of an expression. Like other languages, in Urdu there are three
degrees of intensity; absolute (only positive or negative orientation), comparatives (two
______________________________________________________________________________________
Framework.
distinct entities are compared with each other) and superlative (one of all entities is with
highest orientation)
Polarity: The polarity mark is annotated to each lexicon entry to show its orientation and
intensity.
iii) Classification of the terms

The Objective terms are saved without any polarity mark, but the subjective terms are
further categorized on the bases of orientation and intensity into three types as shown in
Figure 5.8.
a) Absolute subjective terms with orientation only T (O): For such terms, there are only
two possible values, i.e., “+1”, for absolute positive, and “-1”, for absolute negative.
Examples: Absolute Urdu adjectives come in this category, e.g, the adjectives,
“‫( ”ﺧﻮﺑﺼﻮرت‬khoobsurat, beautiful) and “‫( ”ﺑﮩﺎدر‬bhadur, brave) both have positive
orientation and are marked with prior polarity = +1. Whereas, “‫( ”ﮔﮭﭩﯿﺎ‬ghatya, cheap) has
prior polarity = -1, due to its negative orientation.
______________________________________________________________________________________
Framework.
Figure 5.8 Structure of the sentiment annotated lexicon (Syed et al 2011 c).
b) Subjective terms with intensity only T (I): For such terms the prior polarity is assigned
with respect to the possible intensity values, showing the degrees of the polarity, i.e., 1
for absolute, 2 for comparative and 3 for superlative.
Examples: The adjective modifiers are basically the terms with intensity only, e.g., both
the modifiers, “‫( ”ﺑﮩﺖ‬bohat, mush) and “‫( ”زﯾﺎده‬zyadah, more) have prior polarity = 2.
And “‫( ”ﺳﺐ ﺳﮯ زﯾﺎده‬sab say zyadah, most) has prior polarity = 3.
c) Subjective terms with both values of orientation and intensity T (I, O): In this case the
prior polarity is calculated by multiplying the orientation score (+1 or -1) with the
intensity score (1, 2 or 3).
Examples: In Urdu language, very few terms come in this category, For example, the
words “‫( ”ﺑﮩﺘﺮ‬behter, better), “‫( ”ﺑﮩﺘﺮﯾﻦ‬behtareen, best) and “‫( ”ﺑﺮﺗﺮ‬badtar, worse) have
prior polarities = +2, +3 and -2, respectively. These are usually Persian loan words.
iv) Lexicon Entries

Some examples of lexicon entries from all the three categories, i.e., T(O), T(I) and T(I,O)
are given in Table 5.1. For example, the word “‫( ”ﮐﺎﻣﯿﺎب‬kamyaab, successful), has
positive orientation but no intensity. Similarly, “‫( ”زﯾﺎده‬zyada, more) and “‫( ”ﺑﮩﺖ‬bohat,
very) both have intensity and no orientation. Whereas, “‫( ”ﺑﮩﺘﺮ‬behtar, better) and “‫”ﺑﮩﺘﺮﯾﻦ‬
(behtareen, best) both have positive orientation with intensities of a comparative and
superlative degrees, respectively.
Examples of T (O) Examples of T (I) Examples of T (O, I)

“‫( ”ﮐﺎﻣﯿﺎب‬kamyaab, successful) “‫( ”ﺑﮩﺖ‬bohat, very) “‫( ”ﺑﮩﺘﺮﯾﻦ‬behtareen, best)
“‫( ”ﺧﻮﺑﺼﻮرت‬khoobsurat, beautiful) “‫( ”زﯾﺎده‬zyada, more) “‫( ”ﺑﺪﺗﺮﯾﻦ‬badtareen, worst)
“‫( ”ﺑﮩﺎدر‬bahadur, brave) “‫( ”ﺷﺪﯾﺪ‬shadeed, extremely) “‫( ”ﺑﮩﺘﺮ‬behtar, better)
Table 5.1 Examples of lexicon entries.
______________________________________________________________________________________
Framework.
5.14. System integration
The annotated lexicon of Urdu words is integrated with the sentiment classifier as shown
in Figure 5.8. First of all, the given text in the form of a review is taken from the website.
The sentiment classifier component of the systems preprocesses this review, segments it
into sentences and then words. These words are then tagged with the respective parts of
speech. Now, these tagged words are compared with the lexicon entries for sentiment
orientations and intensities. This comparison results into polarity marked or polarity
annotated words and phrases.
POS Tagged
Given review in Urdu text words/phrases
(review) Sentiment
Sentiment
Classifier
annotated lexicon
Classification of Urdu
Polarity-annotated
words
Website words/phrases
Figure 5.9 Integration of the lexicon of Urdu words with the sentiment classifier (Syed et al 2010).
On the basis of the polarities of individual words, the sentence and then its total review
polarity is calculated. We evaluate the overall system using a corpus of movie reviews in
Urdu language; the experimentation is given in Chapter 6. The classification algorithm is
applied on the review corpus. Each subjective word in the review is compared with
lexicon entries for the computation of the polarity scores.
Chapter review:
This Chapter has presented our approach in detail, as well as the modules of the system:
PREPROCESSOR, EXTRACTOR, ASSOCIATOR, and CLASSIFIER (see Figure 6.1).
Each module is described by a separate detailed model. In next section, we evaluate our
model through experimentation.
As a pioneering effort, in this research we describe the structure, construction and
evaluation of a manually tagged sentiment-annotated Urdu words based lexicon as a
component of a sentiment analysis model developed for Urdu text. The lexicon contains
______________________________________________________________________________________
Framework.
information about the subjectivity of an entry in addition to its orthographic,

phonological, syntactic and, morphological aspects. Our approach distinguishes clearly
between the subjective and objective entries in the lexicon. We take two attributes of a
subjective entry; i.e. its orientation (either positive or negative) and its intensity (the force
of the orientation).
After development of the lexicon we integrate it with the sentiment classifier. The
classifier preprocesses the given text and then applies shallow parsing based chunking. It
uses lexicon for comparing all the words/phrases present in the text. As a result, all the
subjective terms in the given text become annotated. The classifier then calculates the
sentiment orientation of the sentences and then the overall review.
______________________________________________________________________________________
Framework.
Chapter 6| Experimentation and Results 88
CHAPTER 6
EXPERIMENTATION AND RESULTS
The evaluation of the sentiment classifiers is typically conducted experimentally, rather

than analytically. The reason is that, the analytical evaluation requires a formal
specification of the problem with respect to how correctness and completeness are
defined, it does not emphasize on the practical effectiveness and performance.
On the other hand, the experimental evaluation of a classifier usually measures its
effectiveness in terms of its ability to take the accurate classification decisions. Hence,
we have performed a series of experiments. These experiments compare two versions of
the system; Model A and Model B and analyze the effect of polarity shifters and
negations. The results are given in Section 6.4. The lexicon and corpora are discussed in
Sections 6.1 and 6.2 respectively. Moreover, Section 6.3 gives the case studies to
illustrate the processing of the major components or modules of the system. Some
example illustrations are given in Section 6.5.
6.1. Lexicon Coverage

The current version of the lexicon contains 1,368 adjectives, which are marked according
to the orientation and the intensity. These adjectives are picked from all classes given in
Chapter 4. As already mentioned, the Urdu adjectives are marked with case through
inflection and derivation, therefore all inflected forms are considered for a single entry.
Moreover, there are 67 modifiers, including both comparative and superlative intensity
levels. The nominal head words are selected according to the domains of the movies and
the electronic appliances, which are 1,920 in number. A summary of the existing version
of the lexicon is given in Table 6.1.
______________________________________________________________________________________
Framework.
Modifiers Adjectival head words Nominal head words
67 1,368 1,920
Table 6.1. Summary of lexicon entries.
6.2. Corpus
Due to the deficiency of publicly accessible corpus of the Urdu language based reviews,
we collect two corpora of reviews to evaluate the efficacy of the employed model. The
first corpus C1 is the collection of 700 movie reviews, among which 385 are positive and
315 are negative. The average document length in this corpus is 264 words. For obtaining
variant reviews, 40 different movies with different popularity scores (already known) and
categories (comedy, drama, historical etc) are given for review.
The second test-bed is a corpus of reviews of the electronic appliances C2. This corpus
comprises a total of 650 reviews with 322 positive and 328 negative. The base collection
has the reviews for three types: refrigerators (237), air-conditioners (250), and televisions
(163). The average review length is 196 words. For achieving diversity, 9 different
brands of the electronic appliances are given for review.
For both corpora, the reviews within the threshold boundary or with neutral scores are
removed. Hence, the data set contains either positive or negative reviews as shown in
Table 6.2.
Domains Total number Average Orientation Number

length
Positive 385
Movies C1 700 264 words
Negative 315
Positive 322
Electronic appliances C2 650 196 words
Negative 328
Table 6.2. Corpora for evaluation.
______________________________________________________________________________________
Framework.
6.3. Case Studies
Before proceeding to the results, we consider different case studies from the Urdu text
and show how the system processes, like; POS tagging, extraction, association and
polarity annotation are performed.
6.3.1. CASE 1 Parts of speech tagging
Consider an example of an Urdu verse for POS tagging.
‫ﻟﻮگ ﺳﺎﺗﮭ آﺗﮯ ﮔﮱ اور ﮐﺎرواں ﺑﻨﺘﺎ ﮔﯿﺎ۔‬ ‫ﻣﯿﮟ اﮐﯿﻼ ﮨﯽ ﭼﻼ ﺗﮭﺎ ﺟﺎﻧﺐ ﻣﻨﺰل ﻣﮕﺮ‬
(main akela hi chala tha jaanib-e-manzil magar, log saath aate gaye aur kaaravaan bantaa
gayaa, I had started all alone towards the destination, but; people kept joining and it
became a caravan.)
The parts of speech tagging results into following allocations:
<SC> ‫<ﻣﮕﺮ‬N> ‫< ﻣﻨﺰل‬ADJ> ‫< ﺟﺎﻧﺐ‬TA> ‫< ﺗﮭﺎ‬VB> ‫< ﭼﻼ‬ADV> ‫< ﮨﯽ‬ADJ> ‫< اﮐﯿﻼ‬PP>‫ﻣﯿﮟ‬
<SM> -<VB> ‫<ﮔﯿﺎ‬VB> ‫ <ﺑﻨﺘﺎ‬NN>‫<ﮐﺎرواں‬CJC>‫<اور‬VB> ‫<ﮔﮱ‬VB>‫< آﺗﮯ‬NN> ‫< ﺳﺎﺗﮭ‬NN> ‫ﻟﻮگ‬
6.3.2. CASE 2 Extraction of targets and SentiUnits

Following examples describe the execution of the EXTRACTOR module.
Example 1:
“‫”ارﺗﻀﯽ ﮐﺎ روﺑﻮٹ ﺑﮍا ﺷﺎﻧﺪار ﮨﮯ‬
Irtaza ka robot bara shaandaar hay
Irtaza’s robot is very fabulous.
In this sentence, both the SentiUnit and the target are complex, i.e., they are composed of
more than one word. The SentiUnit ‫(ﺑﮍا ﺷﺎﻧﺪار‬barashaandaar, very fabulous) is made by
an adjective head word and a positive modifiers. The target of the comment “ ‫ارﺗﻀﯽ ﮐﺎ‬
”‫(روﺑﻮٹ‬Irtaza ka robot, Irtaza’s robot) is based on three words; two nouns with a
possession marker in between, as shown in Table 6.3.
______________________________________________________________________________________
Framework.
Remark Parse
Sentence with complex SentiUnit (SU) and [N PM N] [ADJ ADJ] AUX

‫ارﺗﻀﯽ ﮐﺎ روﺑﻮٹ ﺑﮍا ﺷﺎﻧﺪار ﮨﮯ‬
target (NP)  NP SU AUX
Noun phrase (NP) with possession marker (PM) N PM N NP (Target) ‫ارﺗﻀﯽ ﮐﺎ روﺑﻮٹ‬
SentiUnit made by two adjectives (ADJ) ADJ ADJ SU (SentiUnit) ‫ﺑﮍا ﺷﺎﻧﺪار‬
Table 6.3.Parsing of example 1 into targets and SentiUnits.
Example 2:
“‫”ارﺗﻀﯽ اورﻓﺎطﻤہ ﮐﺎ ﮐﻤﺮه ﮨﻮادارﻧﮩﯿﮟ‬
Irtaza aur Fatima ka kamrah hawadar naheen
Irtaza and fatima’s room is not airy
Again, both the SentiUnit and the target are complex. The SentiUnit ‫(ﮨﻮادارﻧﮩﯿﮟ‬hawadar
naheen, not airy) contains an adjective head and a negation word. The target of the
comment is even more complex, i.e., ‫(ارﺗﻀﯽ اورﻓﺎطﻤہ ﮐﺎ ﮐﻤﺮه‬Irtaza aur Fatima ka
kamrah, Irtaza and fatima’s room) is made by five words; three nouns, a possession
marker and a conjunction. The sentence parse in given in Table 6.4.
Remark Parse
Sentence with complex SentiUnit and [N CJC N PM N] [ADJ ‫ارﺗﻀﯽ اورﻓﺎطﻤہ ﮐﺎ ﮐﻤﺮه ﮨﻮادارﻧﮩﯿﮟ‬
target NEG]  NP SU
Noun phrase with conjunction (CJC) N CJC N PM N NP ‫ارﺗﻀﯽ اورﻓﺎطﻤہ ﮐﺎ ﮐﻤﺮه‬
and possession marker (PM) (Target)
SentiUnit with negation (NEG) ADJ NEG  SU (SentiUnit) ‫ﮨﻮادارﻧﮩﯿﮟ‬
Table 6.4.Parsing of example 2 into targets and SentiUnits.
Example 3:
Here is a short review from Urdu language based movie review corpus.
“‫”ﻓﻠﻢ ﮐﯽ ﮐﮩﺎﻧﯽ ﺑﻮرﻧﮓ ﮨﮯ۔ ﮨﯿﺮو ﮐﯽ اداﮐﺎری اور ﺷﮑﻞ اﭼﮭﯽ ﻧﮩﯿﮟ۔ ﻧﺎ ﮨﯽ ﮨﺪاﯾﺖ ﮐﺎری ﻗﺎﺑﻞ ﺳﺘﺎﯾﺶ ﮨﮯ۔‬
______________________________________________________________________________________
Framework.
filmkikhani boring hay, hero kiadakariaurshakalachinaheen, na hi hdaayetkariqabil-e-

staayesh hay. The story of the film is boring. Hero’s acting and looks are not good. Nor is
the direction appreciable.
Description:
The review is based on three comments. In the first comment no negation particle is used,
the SentiUnit is based on an adjective “‫( ”ﺑﻮرﻧﮓ‬boring, boring) with negative polarity
resulting into an overall negative impact. In the second comment, the SentiUnit is made
by a positive adjective “‫( ”اﭼﮭﯽ‬achi, good) and a negation mark “‫( ”ﻧﮩﯿﮟ‬naheen, not),
which reverse the effect of the adjective to make an overall negative impact. The
SentiUnit in the third comment is based again on a positive word “‫( ”ﻗﺎﺑﻞ ﺳﺘﺎﯾﺶ‬qabil-e-
staayesh, appreciable) and a negation mark “‫( ”ﻧﺎ‬na, nor), which appear in the beginning
of the comment and hence, it conveys a negative opinion.
So, the overall polarity of the review is negative. The POS tagging and phrase chunking
of the review is given in Table 6.5. Column 1 gives the POS tags of each comment and in
column 2; the noun phrases NP along with SentiUnits SU are given.
Phrases & SU POS tags Comments

[NP] [SU] [N + PM + N] [ADJ] ‫ﻓﻠﻢ ﮐﯽ ﮐﮩﺎﻧﯽ ﺑﻮرﻧﮓ ﮨﮯ‬
[NP] [CJC] [NP] [SU] [N + PM + N] [CJC] [N] [ADJ + NEG] ‫ﮨﯿﺮو ﮐﯽ اداﮐﺎری اور ﺷﮑﻞ اﭼﮭﯽ ﻧﮩﯿﮟ‬
[NP] [SU] [NEG] [N] [ADJ] ‫ﻧﺎ ﮨﯽ ﮨﺪاﯾﺖ ﮐﺎری ﻗﺎﺑﻞ ﺳﺘﺎﯾﺶ ﮨﮯ‬
Table 6.5.POS tagging and phrase chunking of the given review.
Example 4: Let us take an example execution of a single sentence. Figure 6.1 shows the
executions steps in detail:
“-‫”ﮔﺎڑی ﮐﺎ ﯾہ ﻣﺎڈل ﺧﻮب ﺻﻮرت ﻧﮩﯿﮟ‬

Garikayeh model khoobsurat naheen hay
This model of the car is not beautiful.
______________________________________________________________________________________
Framework.
-‫ﮔﺎڑی ﮐﺎ ﯾہ ﻣﺎڈل ﺧﻮب ﺻﻮرت ﻧﮩﯿﮟ‬

Preprocessing
‫ﮔﺎڑی ﮐﺎ ﯾہ ﻣﺎڈل ﺧﻮب ﺻﻮرت ﻧﮩﯿﮟ‬
‫ﮔﺎڑی| ﮐﺎ| ﯾہ| ﻣﺎڈل| ﺧﻮب ﺻﻮرت| ﻧﮩﯿﮟ‬
‫ﮐﺎ | ﯾہ | ﻣﺎڈل | ﺧﻮب ﺻﻮرت | ﻧﮩﯿﮟ‬ | ‫ﮔﺎڑی‬

<NOT><ADJ><N><DT><POSS><N>
Shallow Parsing
‫ﮔﺎڑی ﮐﺎ | ﯾہ ﻣﺎڈل | ﺧﻮب ﺻﻮرت | ﻧﮩﯿﮟ‬
<NOT><ADJ><NP><NP>
| ‫ﮔﺎڑی ﮐﺎ | ﯾہ ﻣﺎڈل | ﺧﻮب ﺻﻮرت ﻧﮩﯿﮟ‬

<SentiUnit><NP><NP>
| ‫ﮔﺎڑی ﮐﺎ | ﯾہ ﻣﺎڈل | ﺧﻮب ﺻﻮرت ﻧﮩﯿﮟ‬

Classification
<Negative orientation><NP><NP>
Result: This is a negative comment.
Figure 6.1.Example extraction of the SentiUnits.
6.3.3. CASE 3 Case marking and complex noun phrases

Consider an example of a sentence with complex noun phrase in which, a same sentence
with same meanings is written in three different versions.
Version1:
‫ﻧﮩﯿﮟ ﺗﯿﺮا ﻧﺸﯿﻤﻦ ﮐﺜﺮﺳﻠﻄﺎﻧﯽ ﮐﮯ ﮔﻨﺒﺪ ﭘﺮ‬
naheenteranashemankasr-e sultanikaygunbad par
Your home is not on the tower of the king’s palace
Description:
In this sentence there are three noun phrases. One of them is complex, i.e., “‫”ﮐﺜﺮﺳﻠﻄﺎﻧﯽ‬
(kasr-e sultani, king’s palace). In English translation apostrophe is used as a replacement
of “of”. But in Urdu no indication is visible because the diacritic mark is optional and
mostly ignored. Only the native Urdu readers can understand the right pronunciation and
meaning. This phenomenon is called compounding, which is very common in Urdu texts
(discussed in Chapter 3).
<NEG>‫ﻧﮩﯿﮟ‬
NP 1: <PP>‫ﺗﯿﺮا‬
______________________________________________________________________________________
Framework.
<NN>‫ﻧﺸﯿﻤﻦ‬, NP 1 is based on one noun and one adjective.

NP 2: <NN>‫ﮐﺜﺮ‬
<ADJ>‫ﺳﻠﻄﺎﻧﯽ‬, NP 2 is called compounding of two nouns through diacritic.
‫ﮐﮯ‬
NP 3: <NN>‫ﮔﻨﺒﺪ‬, NP 3 is simple with single noun.
‫ﭘﺮ‬
Version2:
Let us consider another version of the same sentence:
‫ﺗﯿﺮا ﻧﺸﯿﻤﻦ ﮐﺜﺮﺳﻠﻄﺎﻧﯽ ﮐﮯ ﮔﻨﺒﺪ ﭘﺮ ﻧﮩﯿﮟ‬
teranashemankasr-e sultanikaygunbad parnaheen
Description:
In this sentence only the word order is changed but the composition of noun phrases
remain the same.
NP 2: <NN>‫ﮐﺜﺮ‬
‫ﮐﮯ‬
NP 3: <NN>‫ﮔﻨﺒﺪ‬, NP 3 is simple with single noun.
‫ﭘﺮ‬
Version3:
Another version of the sentence is
‫ﺗﯿﺮا ﻧﺸﯿﻤﻦ ﮔﻨﺒﺪ ﮐﺜﺮﺳﻠﻄﺎﻧﯽ ﭘﺮ ﻧﮩﯿﮟ‬
teranashemankasr-e sultanikaygunbad parnaheen
______________________________________________________________________________________
Framework.
Description:
In this case word order is changed and “‫( ”ﮐﮯ‬kay, of) is replaced by the diacritic, making
“‫( ”ﮔﻨﺒﺪ‬gunbad, tower) an additional word in noun phrase, i.e., “‫( ”ﮔﻨﺒﺪ ﮐﺜﺮﺳﻠﻄﺎﻧﯽ‬gunbad-e
kasr-e sultani, king’s palace’s tower). Therefore, the sentence contains tow noun phrases.
NP 2: <NN>‫ﮔﻨﺒﺪ‬
<NN>‫ﮐﺜﺮ‬
‫ﭘﺮ‬
6.3.4. CASE 4 Polarity annotations

Example 1:
“‫”ﻣﯿﺮی ﮐﺘﺎب ﻋﻤﺪه ﮨﮯ‬
merikitabumdah hay
my book is good.
Description:
SentiUnit is made by an adjective as the subjective term with orientation only.
Hence,
SUp = Tp…… (1)
where
Tp = +1
Putting this value in equation 1, we get,
SUp = 1
Thus, the SentiUnit polarity is “1”.
______________________________________________________________________________________
Framework.
Example 2:
“‫”ﻣﯿﺮی ﮐﺘﺎب ﻋﻤﺪه ﻧﮩﯿﮟ ﮨﮯ‬
merikitabumdahnaheen hay
My book is not good.
Description:
The SentiUnit is made by an adjective as the subjective term with orientation only and a
negation term as the polarity shifter.
Hence,
SUp = Tp(Neg)…… (2)
Where
Tp = +1 and Neg = -1
Putting this value in equation 2, we get,
SUp = (+1) (-1)

= -1
Thus, the SentiUnit polarity is negative and is “-1”.
Example 3:
“‫”وه ﺳﺐ ﺳﮯ زﯾﺎده ﺳﺨﯽ ﮨﮯ‬
woh sab say zyadahsakhee hay
He most generous of all
Description:
SentiUnit is made by four lexical units; an adjective as the subjective term with
orientation only, and a superlative modifier made by three words. The adjective polarity
shifts to the superlative degree due to intensity of the modifier.
Hence,
SUp = (Tp1) (Tp2)……. (3)
Where
______________________________________________________________________________________
Framework.
Tp1 = +1 and Tp2 = 3

From equation 3, we get,
SUp = (+1) (3)
= +3
This results into a positive SentiUnit polarity of intensity “3”.
6.3.5. CASE 5 Associating targets with SentiUnits

The ASSOCIATOR module associates the SentiUnits with the targets. For example, take
the linkage specification shown below:
We apply it to the following sentence

“‫”ﺑﺎﻧﮓ درا اﯾﮏ اﭼﮭﯽ ﮐﺘﺎب ﮨﮯ‬
bang-e-daraaikachikitab hay.
Baang-e-Dara is a good book.
The chunker finds “‫”اﭼﮭﯽ‬as a sentiment expression. The ASSOCIATOR module then
searches for the target noun phrase, which is “‫”ﺑﺎﻧﮓ درا‬, the name of the book, as shown
in Figure 6.2.
Figure 6.2.Linking the sentiment expressions with candidate targets.
______________________________________________________________________________________
Framework.
6.4. Results
For evaluating the effectiveness and efficiency of a text classifier only using the accuracy
as the performance metric is not sufficient. Therefore, we use other three metrics; called
the precision P, recall R and F-measure F in addition to the accuracy A. These metrics
can provide much greater insight into the performance features of a classifier.
Definition 1: For a sentiment classifier the accuracy A can be defined as the measure of
how close the document classification suggested by the classifier is, to the actual
sentiments present in the review.
Definition 2: The precision P measures the exactness of a classifier. A higher P means

less false positive and vice versa. In terms of true positive tp, false positive fp, true
negative tn and false negative fn, P can be defined as:
P = tp / (tp + fp)
Definition 3: The recall R measures the sensitivity or completeness of the classifier.

Higher R means less false negative and vice versa. In terms of tp, fp, tn and fn, R can be
defined as:
R = tp / (tp + fn)
Definition 3: F-measure is produced by combining Precision and Recall, which is the

weighted harmonic mean of both values, as defined below:
F = 2 PR/ (P+R)
A series of four experiments in two sets with two models of the system have been
performed. The model A is the former version of the system with the EXTRACTOR
module only (Syed et al. 2010) and the model B is the final version in which the
ASSOCIATOR module is attached (Syed et al. 2012). By using this testing, the efficacy
______________________________________________________________________________________
Framework.
and usability of the extended version are easily compared. Both models are applied on
both corpora C1 and C2 separately.
6.4.1. Model A
Table 6.6 and Table 6.7 show the results of the experiments performed by model A on
both corpora C1 and C2. Table 6.6 shows the detailed results with P, R, F and A values
separately computed for positive as well as negative reviews.
Orientation Corpora Precision Recall F-measure Accuracy

C1 0.737 0.681 70.8 74%
Positive
C2 0.795 0.737 76.5 79%
C1 0.698 0.654 67.5 66%
Negative
C2 0.785 0.767 77.6 77%
Table 6.6. Experimental results in terms of P, R, F and A for model A (Syed et al. 2012)
Table 6.7 shows a comparative summary of the results from both corpora. The accuracy
of C1 is 70% and variation in positive and negative reviews is 8%. Whereas the accuracy
of C2 is 78% and variation in positive and negative reviews is 2%. The total accuracy of
model A is 74%.
Orientation Accuracy Variation Corpora Total Accuracy

Corpora
Accuracy
Pos 74%
C1 8% 70%
Neg 66%
74%
Pos 79%
C2 2% 78%
Neg 77%
Table 6.7. Comparison of accuracy from both corpora C1 and C2 for model A.
______________________________________________________________________________________
Framework.
6.4.2. Model B
For the next two experiments we include ASSOCIATOR module and tested both corpora.
The results are shown in Table 6.8 and Table 6.9. Table 6.8 shows the experimental
results in terms of P, R, F, and A for model B applied on C1 and C2 for positive and
negative reviews separately.
Orientation Corpora Precision Recall F-measure Accuracy

C1 0.822 0.795 80.8 80%
Positive
C2 0.897 0.877 88.7 88%
C1 0.795 0.777 78.6 77%
Negative
C2 0.865 0.832 84.8 84%
Table 6.8. Experimental results in terms of P, R, F and A for model B (Syed et al. 2012).
Results from Table 6.8 are compared and summarized in Table 6.9. The accuracy of C1 is
improved to 78.5%, and the variation in positive and negative reviews is decreased to
3%. Likewise, the accuracy of C2 is increased to 86.5%. In this case the variation in the
accuracy of positive and negative reviews is also increased to 3%. The total accuracy of
model B is 82.5%.
Orientation Accuracy Variation Corpora Total Accuracy

Corpora
Accuracy
Pos 80%
C1 3% 78.5%
Neg 77%
82.5%
Pos 88%
C2 3% 86.5%
Neg 85%
Table 6.9. Comparison of accuracy from both corpora C1 and C2 for model B.
______________________________________________________________________________________
Framework.
Observations:
From the above results it is clear that the classification accuracy is highly domain
specific. The reviews in C1 are more challenging to classify as compared to those of
electronic appliances in C2. The reason is that these reviews contain more allegory which
results into more divergence, not only syntactic or semantic structure, but also in
appraisal type. Discussion about the movie plot and its characters weather good or evil is
very frequent phenomenon. This discussion results into a number of appraisal targets
which further can lead to the selection of the wrong linkage. On the other hand all
positive or negative comments about the parts of an electronic appliance are indirectly
related to the same target.
Moreover, the classification accuracy also depends upon the orientation of the review.
From results, it is also perceptible that negative reviews are more prone to be
misclassified than the positive ones.
6.4.3. Effect of Negation
On the basis of the above discussion it is clear that the negation markers affect the results
of the analyzer to much extend, therefore, we carry out experimentation to analyze the
behavior of the negation. For this reason, we divide the dataset into three different sets of
data. During the test-bed normalization process, we clean out the neutral comments from
all the three sets.
Test Data Sets Precision Recall F-Measure

Set 1 0.864 0.837 0.850
Set 2 0.590 0.779 0.677
Set 3 0.510 0.615 0.558
Table 6.10 Effect of negation in terms of P, R and F (Syed et al. 2011).
______________________________________________________________________________________
Framework.
Set 1: In the Set 1, we include the sentences, in which, both implicit and explicit negation
is absent. The polarity of these sentences depends only on the subjective terms and other
polarity shifters.
Set 2: The Set 2 contains those sentences, in which only explicit negation particles are
used and implicit negation is absent.
Set 3: To compile the Set 3, we add implicit negation sentences in the Set 2. In this set
both implicit and explicit negation is present in addition to polar terms.
The Table 6.10 gives the results from the three sets of data, in terms of precision, recall,
and f-measure. From these values the total performance accuracy is about 77%. The Set 1
in which only polar terms are present, gives the best results of the classification.
Whereas, the results from Set 2 are lower than the previous one, as it contains only the
sentencs with the negation particles. From this result, it is infered that the negation
particles can cause relatively high rate of missclassofication. But, the average accuracy
from Set 1 and Set 2 is quite satisfactory. The results from Set 3 show that the implicit
negation still needs an improved treatment.
Observations:
Apart from the results, we have following worth mentioning observations about
negations:
 On the average two to three negation particles appear in a single review and the use of
negation is author dependent; some authors tend to use more negative particles than
others. “‫( ”ﻧﮩﯿﮟ‬naheen, not) is the most used particle. In comparative, sentences, the
negation particle “‫( ”ﻧﺎ‬na, no) is used with multiple targets of the appraisal.
 The sentential negation is rarely misclassified as compare to the constituent negation.
 Morphological negation is automatically handled, because most of the words
inflected by the lexical negation marks,
e.g., “‫( ”ﺑﮯ‬bay), “‫( ”ﺑﺎ‬ba), etc, are already present in the lexicon and are annotated
with respective polarities,
e.g., “‫( ”ﺑﮯ ﻓﺎﯾﺪه‬bayfayeeda, useless) is a lexical entry with a negative polarity.
______________________________________________________________________________________
Framework.
6.5. Example illustrations
Example 1: Positive Review
Consider the following review about a laptop:
‫ اس ﮐﯽ ﭘروﺳﯾﺳﻧﮓ ﮐﯽ‬.‫ ﯾہ اﯾﮏ ﺷﺎﻧدار ﭼﯾز ﮨﮯ‬.‫ﭘﭼﮭﻠﮯ ﻣﮩﯾﻧﮯ ﻣﯾں ﻧﮯ اﯾﮏ ﻟﯾپ ﭨﺎپ ﺧرﯾدا ﮨﮯ‬
.‫ ﺟو ﻣﯾرے ﻟﺋﮯ ﻣﻔﯾد ﮨﮯ‬،‫ اﮔرﭼہ ﺑﯾﭨری دﯾرﭘﺎ ﻧﮩﯾں ﮨﮯ‬.‫ آﭘرﯾﭨﻧﮓ ﺳﺳﭨم ﺑﮩﺗرﯾن ﮨﮯ‬.‫رﻓﺗﺎر ﺑﮩت ﺣﯾرت اﻧﮕﯾز ﮨﮯ‬
Translation: Last month I bought a laptop. It is a wonderful thing. Its processing speed is
very amazing. The operating system is the best. Though, its battery is not long lasting.
But this is good for me.
Figure 6.3 Example of a positive review.
______________________________________________________________________________________
Framework.
Result:
This is a positive review as the result of the analysis shows in Figure 6.4.
Figure 6.4 Result of the analysis.
______________________________________________________________________________________
Framework.
Example 2: Negative Review

Now, consider another review related to a movie:
‫ﺑﮩت ﻋرﺻﮯ ﮐﮯ ﺑﻌد اﯾﮏ ﻓﻠم دﯾﮑﮭﻧﮯ ﮐو ﻣﻠﯽ ﮨﮯ۔ ﻓﻠم ﮐﺎ ﭨﺎﭘﮏ ﭘراﻧﺎ ﮨﮯ۔ ﯾہ ﻓﻠم اﻧﺗﮩﺎي ﻓﺎﻟﺗو اور ﻧﺎﻗﺎﺑل ﺑرداﺷت ﮨﮯ۔ ﻓﻠم‬
‫ﻣﯾں ﮨﯾرو ﮐﯽ اﯾﮑﭨﻧﮓ ﺑﮩﺗرﯾن ﮨﮯ۔ ﻓﻠم دﯾﮑﮭ ﮐر زﯾﺎده ﻣزه ﻧﮩﯾں آﯾﺎ۔‬
Translation: After a long time, got a film to watch. The film’s topic is old. This film is
very rubbish and intolerable. Hero’s acting is the best in the movie. It was not a fun to
watch the movie.
Figure 6.5 Example of a negative review.
______________________________________________________________________________________
Framework.
Result:
Figure 6.6 Result of the analysis.
______________________________________________________________________________________
Framework.
Chapter 7| Conclusions and Future Directions 107
CHAPTER 7
CONCLUSIONS AND FUTURE DIRECTIONS
This dissertation has investigated the automatic sentiment analysis of a morphologically

rich and resource poor language: Urdu. This grammatically motivated approach is
apposite for handling the complex morphology and variable vocabulary of the target
language. We have applied the core natural language processing tasks for word
segmentation, POS tagging and phrase chunking. Our systems have evolved from a
simple phrase chunking model presented in (Syed et al, 2010) to a more flexible and
mature approach given in (Syed et al, 2012). The results from both the versions are
presented and compared in Chapter 6.
Conclusions from linguistic aspects

As the first effort for the sentiment analysis of the Urdu language we have come to
different conclusions regarding the characteristic features of this language and the
challenges it poses for the automatic processing. For example, Urdu is context sensitive
and hence its word segmentation is itself a great issue to handle. Due to this feature word
boundary identification is not as straight forward as for English language.
Another considerable problem is the complex morphology of the Urdu text, which results
into intricate lexicon structure. Our lexicon is based on the adjectives, their modifiers,
polarity shifters and negations. Also the extended version of the lexicon contains nouns.
We have considered adjectives of all the types discussed in Chapter 4, to make our
system more inclusive. The compilation of the lexicon suggests that the handling of Urdu
nouns is much more complicated than adjectives because of the separate case markers,
which are used as the possession markers. Our algorithm handles these possession
markers as the separate lexical unit.
______________________________________________________________________________________
Framework.
Urdu adjectival phrases are morphologically complex. In Section 4.1, we have discussed
both marked and unmarked adjectives, which are borrowed from many languages, like
Persian, Arabic, Hindi, Sanskrit, and English. This diversity results into flexibility and
variety in the morphological and grammatical rules. For example, the adjectives which
are Persian loan follow Persian grammar and usually remain unmarked, likewise, the
Sanskrit based adjectives show inflections for gender and number, etc.
Almost all types of adjectives, descriptive, attributive, predicative, demonstrative, etc.
show agreement in case, gender and number with the noun they qualify.
Similarly, some other linguistic phenomena are specific to Urdu language, e.g., frequent
reduplication (partial as well as full), compounding, frequent inflections and derivations.
Conclusions from technical aspects

A sentiment-annotated lexicon turns out to be more intricate as compared to other NLP
lexicons. There are two reasons for this intricacy:
 Each lexicon entry demonstrates its polarity information in addition to its
orthographic, phonological, syntactic and, morphological features. This polarity
information is usually represented as either positive, or negative or neutral. For
example, SentiWordNet (Andreevskaia and Bergler, 2006), use triplets [positive,
negative, objectives], with minimum value 0.0 and maximum 1.0.
 Most of the words exhibit multiple orientations depending upon their use and domain.
For example, “This damage is everlasting”. In this sentence, the everlasting is a
positive word, but the comment’s overall orientation is negative. Also, unpredictable
is a positive word when used about a movie’s plot, and becomes negative for the
performance of a microwave oven.
Moreover the above mensioned linguistic aspects of the Urdu language result into much
complex lexicons. There is a much higher out of vocabulary rate as compared to other
well defined grammars. Also, it results into poor or unreliable language model probability
estimation, because there are many combinations of word forms which are missing or
rarely available in the language model training data.
______________________________________________________________________________________
Framework.
It is observed that the domain of the test beds affect the classification accuracy. The
results for one domain are different from the other. Moreover, the orientation of the text
to be analysed affects the accuracy to much extent. The negative reviews are more prone
to be misclassified than the positive ones.
For this reason our approach handles the phase-level negation as part of the SentiUnits,
which contain adjectives as the core terms and include the negation particles as their
logical constituents. Hence, the total effect of the negation is dealt along with the effect
of the subjective words. This approach is much appropriate to handle the free word order
property of the Urdu language. Also, it handles the variant grammatical structures of the
Urdu sentences, very successfully, as indicated by the experimentation results, with an
overall accuracy of 77%.
Although, shallow parsing based approach is appropriate for handling the simple
opinions, but it results into misclassifications when applied on complex sentences with
multiple targets. Therefore, the approach presented in Model B, which uses dependency
parsing after the shallow parsing is much more reliable.
Directions for future Endeavors

The classification accuracy is highly domain specific, because the results from the
domain of electronic appliances are with higher accuracy than those for movies. The
problem of domain independence is still an open issue in the sentiment analysis research
community, even for English language. Therefore, our primary future work is to increase
the knowledge of the Urdu language by including more adjectives and other parts of
speech. The lexicon can be extended on the same model by introducing some new rules
for handling adjectives and adjectival phrases, adverbs and adverbial phrases, verbs and
verb phrases, etc. We reckon to extend the lexicon to such an extent, which can make our
model, domain independent.
Another future direction is to update our model for handling the Hindi language which is
morphologically similar to Urdu but is orthographically different. Due to the absence of
segmentation and diacritic issues we believe that our updated model can perform well for
Hindi language also. This model can also be applied on some other morphologically rich
______________________________________________________________________________________
Framework.
languages, like, Punjabi, Persian, Sindhi etc, which have same orthography and very
similar grammar rules.
Most of the research works presented for English language rely only on the extraction of
the adjectives or adjectival phrases. There are a very few contributions which have
considered adverbs or adverbial phrases. In future, we deem to extend our model by
adding adverbial phrases in combination with adjectival phrases for handleling more
diversified opinions. In this way both aspects, i.e., functions and attributes of the target
product can be handled. The main strength of this model is its flexibility. As we have
considered the classification at the phrase level so we can add new rules and new phrases
very easily to the core model without making major alterations in the algorithm.
______________________________________________________________________________________
Framework.
References 111
REFERENCES
1. Abbasi A, Chen H, Salem A (2008) Sentiment analysis in multiple languages: feature

selection for opinion classification in web forums. ACM Trans Inf Syst, pp 1–34
2. Abdul-Mageed M, Korayem M (2010) Automatic identification of subjectivity in
morphologically rich languages: case of Arabic. In: Proceedings of the 1st workshop on
computational approaches to subjectivity & sentiment analysis (WASSA), Lisbon pp 2–6
3. Ahmed T, Hautli A (2010) An Experiment for a basic lexical resource for Urdu on the
basis of Hindi WordNet. In: Proceedings of CLT 2010.
4. Akram Q, Naseer A, Hussain S (2009) Assas-Band, an Affix-Exception-List based Urdu
stemmer. In: Proceedings of 7th workshop on Asian Language Resources, pp 40-47.
5. Ali W, Hussain S (2010) A hybrid approach to Urdu verb phrase chunking. In:
Proceedings of 8th workshop on Asian Language Resources, pp 137-143.
6. Andreevskaia A, Bergler S (2006)MiningWordNet for fuzzy sentiment: sentiment tag
extraction fromWord-Net glosses. In Proceedings of the 11th conference of the European
chapter of the association for computational linguistics, EACL-2006, Trent, pp 209–216
7. Annet M, Kondrak G (2008) A comparison of sentiment analysis techniques: polarizing
movie blogs. In: Proceedings of Canadian AI, pp 25–35
8. Anwar W, Wang X, Wang XL (2006) A survey of automatic Urdu language processing.
In: Proceedings of 5th international conference on Machine Learning and Cybernetics.
9. Baker P, Hardie A, McEnery T, Jayaram BD (2003) Corpus data for South Asian
language processing. In: Proceedings of the EACL workshop on South Asian languages,
Budapest
10. Bansal M, Cardie C, Lee L (2008) The power of negative thinking: exploring label
disagreement in the min cut classification framework, Manchester. In: Proceedings of
COLING pp 13–16
______________________________________________________________________________________
Framework.
References 112
11. Bhattacharyya P (2010) IndoWordNet. In Proceedings of the Seventh conference on

International Language Resources and Evaluation (LREC’10).
12. Bhattacharyya P, Pande P, Lupu L (2008) Hindi WordNet. Linguistic Data Consortium,
Philadelphia.
13. Bloom, K., Argamon, S.: Unsupervised Extraction of Appraisal Expressions. In:
Farzindar, A., Kešelj, V. (eds.) Canadian AI 2010. LNCS (LNAI), vol. 6085, pp. 290–
294. Springer, Heidelberg (2010)
14. Breck E, Choi Y, Cardie C (2007) Identifying expressions of opinion in context. In:
Proceedings of IJCAI’07. Menlo Park, CA, pp 2683–2688
15. Choi Y, Cardie C (2008) Learning with compositional semantics as structural inference
for subsentential sentiment analysis. In: Proceedings of the conference on empirical
methods in natural language processing, Honolulu, HI, pp 793–801
16. Crilley K (2001) Information warfare: new battle fields, terrorists, propaganda, and the
Internet. ASLIB Proc 53(7): 250–264
17. Dalal A, Nagaraj K, Sawant U, Shelke S (2006) Hindi part of speech tagging and
chunking: A maximum entropy approach. In: Proceedings of NLPAI Machine Learning
Context.
18. Dave K, Lawrence S, Pennock DM (2003) Mining the peanut gallery: opinion extraction
and semantic classification of product reviews. In: Proceedings of the twelfth
international world wide web conference (WWW 2003), Budapest, pp 519–528
19. Durrani N, Hussain S (2010) Urdu word segmentation. In: Proceedings of 11th annual
conference of the North American chapter of the association for computational
linguistics, Los Angeles
20. Fellbaum C, editor. 1998. WordNet: An Electronic Lexical Database. Cambridge: The
MIT Press.
21. Glaser J, Dixit J, Green DP (2002) Studying hate crime with the Internet: What makes
racists advocate racial violence?. J Soc Issues 58(1):177 193
22. Hardie A (2003) Developing a tagset for automated part-of-speech tagging in Urdu. In:
Proceedings of the conference of the corpus linguistics, Lancaster
______________________________________________________________________________________
Framework.
References 113
23. Hatzivassiloglou V, McKeown KR (1993). Towards the automatic identification of

adjectival scales: Clustering adjectives according to meaning. In: Proceedings of the 31st
Annual Meeting of the ACL, pp 172-182
24. Hatzivassiloglou V, McKeown KR (1997) Predicting the semantic orientation of
adjectives. In: Proceedings of ACL’97. Stroudsburg, PA, pp 174–181
25. Hatzivassiloglou V,Wiebe JM(2000) Effects of adjective orientation and gradability on
sentence subjectivity. In Proceedings of the 18th international conference on
computational linguistics, New Brunswick, NJ
26. Hautli A, Butt M (2011) Towards a computational semantic analyzer for Urdu. In:
Proceedings of the 9th workshop on Asian Language Resources, pp 71-78.
27. Higashinaka R, Prasad R, Walker MA (2006) Learning to generate naturalistic utterances
using reviews in spoken dialogue systems. In: Proceedings of the 21st international
conference on computational linguistics and 44th annual meeting of the ACL, Sydney, pp
265–272
28. Hu M, Liu B (2004) Mining and summarizing customer reviews. In Proceedings of
SIGKDD’04, pp 168–177
29. Humayoun M, Hammarström H, Ranta A (2007) Urdu morphology, orthography and
lexicon extraction. In: Proceedings of the 2nd workshop on computational approaches to
Arabic script-based languages. Stanford, USA, pp 59–66
30. Ijaz M, Hussain S (2007) Corpus based Urdu lexicon development. In: Proceedings of the
conference on language technology, University of Peshawar, Pakistan
31. Jang H, Shin H (2010) Language-specific sentiment analysis in morphologically rich
languages. In: Proceedings of the COLING, Poster Volume, Beijing, pp 498–506
32. Jia, L., Yu, C., Meng, W.: The effect of negation on sentiment analysis and retrieval
effectiveness. ACM (2009)
33. Kaji N, Kitsuregawa M (2007) Building lexicon for sentiment analysis from massive
collection of html documents. In: Proceedings of EMNLP’07, pp 1075–1083
34. Kamps J, Marx M, Mokken RJ, de Rijke M (2004) Using Wordnet to measure semantic
orientation of adjectives. In Proceedings of LREC’04, pp 1115–1118
______________________________________________________________________________________
Framework.
References 114
35. Kennedy A, Inkpen D (2006) Sentiment classification of movie and product reviews
using contextual valence shifters. Computational Intelligence 22(2):110–125
36. Kennedy, Inkpen, D.: (2005) Sentiment Classification of Movie Reviews Using
Contextual Valence Shifters. In: Proceedings of FINEXIN (2005)
37. Khan S A, Anwar W, Bajwa U I (2011) Challenges in developing a rule based Urdu
stemmer, In: Proceedings of 2nd workshop on south and southeast Asian Natural
Language Processing, pp 46-51.
38. Kim S-M, Hovy E (2006) Automatic identification of pro and con reasons in online
reviews. In: Proceedings of the COLING, Sydney pp 483–490
39. Kumar A, Siddiqui T (2008) An Unsupervised Hindi Stemmer with Heuristics
Improvements. In: Proceedings of the Second Workshop on Analytics for Noisy
Unstructured Text Data.
40. Lehal GS (2009) A two stage word segmentation system for handling space insertion
problem in Urdu script. In: Proceedings of world academy of science, engineering and
technology, Bangkok pp 321–324
41. Lehal GS (2010) A word segmentation system for handling space omission problem in
Urdu script. In: Proceedings of the 1st workshop on South and Southeast Asian natural
language processing (WSSANLP), the 23rd international conference on computational
linguistics, COLING, Beijing, pp 43–50
42. Moilanen, K., Pulman, S.: The Good, the Bad, and the Unknown. In: Proceedings of
ACL/HLT (2008)
43. Muaz A, Ali A, Hussain S (2009) Analysis and development of Urdu POS tagged
corpora. In: Proceedings of the 7th workshop on Asian language resources, ACL-
IJCNLP, Suntec, Singapore, pp 24–31
44. Mukund S, Ghosh D (2011) Using sequence kernels to identify opinion entities in Urdu.
In: Proceedings of the 15th conference on Computational Natural Language Learning, pp
58-67.
45. Mukund S, Ghosh D, Srihari RK (2010) Using cross-lingual projections to generate
semantic role labeled corpus for Urdu—a resource poor language. In: Proceeding of the
23rd international conference on computational linguistics COLING, Beijing pp 797–805
______________________________________________________________________________________
Framework.
References 115
46. Mukund S, Srihari R (2010) An Information-Extraction System for Urdu- A resource

poor language. ACM Ttramsactions on Asian Language Information Processing, vol.9,
No.4, Article 15.
47. Mullen T, Collier N (2004) Sentiment analysis using support vector machines with
diverse information sources. In: Proceedings of the conference on empirical methods in
natural language processing, Barcelona, pp 412–418
48. Na J-C, Sui H, Khoo C, Chan S, Zhou Y (2004) Effectiveness of simple linguistic
processing in automatic sentiment classification of product reviews. In: Proceedings of
conference of the international society of knowledge organization (ISKO), pp 49–54
49. Paik J H, Parui S K (2008) A Simple Stemmer for Inflectional Languages. Forum for
Information Retrieval Evaluation.
50. Pang B, Lee L (2004) A sentimental education: sentiment analysis using subjectivity
summarization based on minimum cuts. In: Proceedings of the 42nd meeting of the
association for computational linguistics, Barcelona, pp 271–278
51. Pang B, Lee L (2008) Opinion mining and sentiment analysis. Found Trends Inf
Retrieval 2(1–2):1–135
52. Pang B, Lee L, Vaithyanathan S (2002) Thumbs up? Sentiment classification using
machine learning techniques. In: Proceedings of the conference on empirical methods in
NLP, Philadelphia, PA, pp 79–86
53. Pennebaker, J W, Mehl M R, Niederhoffer K (2003) Psychological aspects of natural
language use: Our words, our selves. Annual Review of Psychology 54, pp 547–577
54. Polanyi, L., Zaenen, A.: Context Valence Shifters. In: Proceedings of the AAAI Spring
Symposium on Exploring Attitude and Affect in Text (2004)
55. Riaz K (2010) Rule based named entity recognition in Urdu. In: Proceedings of the 2010
Named entities Workshop, ACL 2010, pp 126-135.
56. Riloff E, Wiebe J (2003) Learning extraction patterns for subjective expressions. In:
Proceedings of the conference on empirical methods in natural language processing
(EMNLP), Sapporo pp 25–32
______________________________________________________________________________________
Framework.
References 116
57. Riloff E, Wiebe J, Wilson T (2003) Learning subjective nouns using extraction pattern
bootstrapping. In: Proceedings of the 7th conference on natural language learning,
Edmonton, pp 25–32
58. Rizvi SMJ, Hussain M (2005) Modeling case marking systems of Urdu-Hindi languages
by using semantic information. In: Proceedings of natural language processing and
knowledge engineering, pp 85–90
59. Schmidt RL (1999) Urdu: an essential grammar. Routledge Publishing, New York
60. Sharifloo A A, Shamsfard M (2008) A Bottom up Approach to Persian Stemming. In:
Proceedings of the 3rd International Joint Conference on Natural Language Processing.
61. Singh A, Bendre S, Sangal R (2005) HMM based chunker for Hindi. In: Proceedings of
IJCNPL-05: 2nd international joint conference on Natural Language Processing.
62. Snyder B, Barzilay R (2007) Multiple aspect ranking using the Good Grief algorithm. In:
Proceedings of the joint human language technology/North American chapter of the ACL
conference, Rochester, NY pp 300–307
63. Stone PJ, Dunphy DC, Smith MS, Ogilvie DM (1966) The general inquirer: a computer
approach to content analysis. MIT Press, Cambridge
64. Syed AZ, Muhammad A, Martínez-Enríquez AM (2012) Associating Targets with
SentiUnits: A Step Forward in Sentiment Analysis of Urdu Text. In: Artificial
Intelligence Review.
65. Syed AZ, Muhammad A, Martínez-Enríquez AM (2011) (a) Sentiment Analysis of Urdu
Language: Handling Phrase-Level Negation. In: Proceedings of the 10thMexican
international conference of artificial intelligence, pp 382–393
66. Syed, AZ, Muhammad A, Martinez-Enriquez, AM (2011) (b) Adjectival Phrases as the
Sentiment Carriers in the Urdu Text. Journal of American Science 7(3), 644–652
67. Syed AZ, Muhammad A, Martínez-Enríquez AM (2011) (c) Sentiment-Annotated
Lexicon Construction for an Urdu Text Based Sentiment Analyzer. In: Pakistan Journal
of Science (2011), ISSN: 0030-9877
68. Syed AZ, Muhammad A, Martínez-Enríquez AM (2010) Lexicon based sentiment
analysis of Urdu text using SentiUnits. In: Proceedings of the 9th Mexican international
conference of artificial intelligence, Pachuca, Mexico, pp 32–43
______________________________________________________________________________________
Framework.
References 117
69. Tan S, Cheng X, Wang Y, Xu H (2009) Adapting Naive Bayes to domain adaptation for
sentiment analysis. In: Proceedings of the 31st European conference on IR research on
advances in information retrieval, pp 337–349
70. Thabet N (2004) Stemming the Qur’an. In: Proceedings of the Workshop on
Computational Approaches to Arabic Script-based Languages.
71. Tsarfaty R, Seddah D, Goldberg Y, Kübler S, Candito M, Foster J, Versley Y, Rehbein I,
Tounsi L (2010) Statistical parsing of morphologically rich languages (SPMRL) what,
how and whither. In: Proceedings of the NAACL HLT 2010 first workshop on statistical
parsing of morphologically-rich languages, Los Angeles, pp 1–12
72. Turney P (2002) Thumbs up or thumbs down? Semantic orientation applied to
unsupervised classification of reviews. In: Proceedings of 40th meeting of the association
for computational linguistics, Philadelphia, PA, pp 417–424
73. Turney P, Littman M (2003) Measuring praise and criticism: inference of semantic
orientation from association. ACM Trans Inf Syst 21(4):315–346
74. Whitelaw C, Garg N, Argamon S (2005) Using appraisal groups for sentiment analysis.
In: Proceedings of ACM SIGIR conference on information and knowledge management
(CIKM 2005), Bremen, pp 625–631
75. Wiebe J, Wilson T, Bruce R, Bell M, Martin M (2004) Learning subjective language.
Comput Linguist 30(3):277–308
76. Wiegand, M., et al.: A survey on the role of negation in sentiment analysis. In:
Proceedings of the Workshop on Negation and Speculation in Natural Language
Processing 2010. Association for Computational Linguistics (2010)
77. Wilson, T., Wiebe, J., Hoffmann, P.: Recognizing Contextual Polarity in Phrase-level
Sentiment Analysis. In: Proc. HLT/EMNLP (2005)
78. Yang K,YuN,ValerioA, ZhangH(2006)WIDIT in TREC 2006 Blog Track. In:
Proceedings of Text REtrieval conference—TREC
79. Yu H, Hatzivassiloglou V (2003) Towards answering opinion questions: separating facts
from opinions and identifying the polarity of opinion sentences. In: Proceedings of
EMNLP’03, pp 129–136
______________________________________________________________________________________
Framework.

Redefining Urdu Morphology

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Redefining Urdu Morphology

Încărcat de

Drepturi de autor:

Formate disponibile

REDEFINING URDU MORPHOLOGY AND GRAMMAR FOR

THE DEVELOPMENT OF AN INTEGRATED SENTIMENT

AFRAZ ZAHRA SYED

Department of Computer Science and Engineering

AFRAZ ZAHRA SYED

Department of Computer Science and Engineering

A dissertation submitted in partial fulfillment of the requirements for the

Afraz Zahra Syed (2007-PhD-CS-07)

Approved on: ______________________________

Internal Examiner: __________________________

Dr. Muhammad Aslam

External Examiner: __________________________

Dr. Farooq Ahmad

1) Dr. Escalada Imaz, Gonzalo

2) Dr. Muhammad Adeel Talib

3) Dr. Muhammad Tahir Abbas Khan

b) From within the Country

Dr. Farooq Ahmad

Dr. Muhammad Aslam

CHAPTER 2: STATE OF THE ART RESEARCH 14

CHAPTER 3: DISTINCTIVE FEATURES OF THE URDU LANGUAGE 34

CHAPTER 4: SENTIUNITS: THE APPRAISAL EXPRESSIONS 47

CHAPTER 5: IMPLEMENTATION: CLASSIFICATION MODEL AND 66

CHAPTER 6: EXPERIMENTATION AND RESULTS 88

CHAPTER 7: CONCLUSIONS AND FUTURE DIRECTIONS 107

1.1 Summary of the given review in terms of sentiment analysis 6

3.1 Character set of Urdu. 35

1. Syed AZ, Muhammad A, Martínez-Enríquez AM (2012) Handling the Effect of

2. Syed AZ, Muhammad A, Martínez-Enríquez AM (2012) Associating Targets with

3. Syed AZ, Muhammad A, Martínez-Enríquez AM (2011) Sentiment Analysis of

4. Syed, AZ, Muhammad A, Martinez-Enriquez, AM (2011) Adjectival Phrases as

5. Syed AZ, Muhammad A, Martínez-Enríquez AM (2011) Sentiment-Annotated

6. Syed AZ, Muhammad A, Martínez-Enríquez AM (2010) Lexicon based sentiment

1.4. Research Motivation

1.5. Research contribution

 Sentiment analysis is a challenging computational linguistic or natural language

1.6. The Problem of Sentiment Analysis

Sentence Target Source Appraisal Orientation

Table 1.1 Summary of the given review in terms of sentiment analysis.

1.3.1. Targets of the appraisal

1.3.2. Sources of the appraisal

1.3.3. Appraisal expressions

1.4. Sentiment annotated lexicon

1.5. Problem statement

AEXTRACTOR, and ASSOCIATOR modules of the system (presented in Chapter 5).

1.6. System Evolution

Task 1: The extraction of the SentiUnits

1.7. Dissertation Outline

STATE OF THE ART RESEARCH

2.7. Features of the given text

Type Focused features Contributions

Term based Term presence and position Pang et al. (2002)

Technique Used Contributions

Unsupervised bootstrapping Riloff, Wiebe (2003)

2.9. Sentiment-annotated-lexicon construction

to their morphological, grammatical and phonological information. This sentiment

Research focus Contributions

Manually compiled Stone et al. (1966)

Corpus based Hatzivassiloglou and McKeown (1997)

Research focus Contributions

POS tagged corpora Hardie (2003)

Semantic role labeled corpus Mukund et al. (2010)

2.10. Generalization among domains