Documente Academic
Documente Profesional
Documente Cultură
SUPERSIVSED BY
DR. MUHAMMAD ASLAM
(2013)
Dissertation
Submitted in partial fulfillment of the requirements for the degree of
Doctor of Philosophy in Computer Science
(2013)
SUPERSIVSED BY
DR. MUHAMMAD ASLAM
ii
Redefining Urdu Morphology and Grammar for the Development
of an Integrated Sentiment Analysis Framework
By
_______________________________ _______________________________
Chairman, Dean,
Department of Computer Science and Engineering, Faculty of Electrical Engineering,
University of Engineering and Technology, University of Engineering and Technology,
Lahore, Pakistan. Lahore, Pakistan.
iii
This thesis has been evaluated by the following examiners:
External Examiners
a) From Abroad
Internal Examiner
iv
v
ABSTRACT
The rise of social networking sites and blogs has simulated a bull market in personal opinion;
consumer recommendations, product reviews, ratings, and other types of online expressions. For
computational linguistic researchers, this fast-growing heap of information has opened an
exciting research frontier, referred as, the Sentiment Analysis (SA). For English, this area is
under consideration from last decade. But, other major languages, like Urdu, are totally
overlooked by the research community. Urdu is a morphologically rich and recourse poor
language. The distinctive features, like, complex morphology, flexible grammar rules, context
sensitive orthography and free word order, make the Urdu language processing a challenging
problem domain. For the same reasons, sentiment analysis approaches and techniques developed
for other well-explored languages are not workable for Urdu text.
This dissertation presents a grammatically motivated, sentiment classification framework to
handle these distinctive features of the Urdu language. The main research contributions are; to
highlight the linguistic (orthography, grammar and morphology, etc.) as well as technical
(parsing algorithm, lexicon, corpus, etc.) aspects of this multidimensional research problem, to
explore Urdu morphological operations, grammar and orthographic rules, to redefine these
operations and rules with respect to the requirements of sentiment analysis framework. The
orthographical, morphological, grammatical and finally the conceptual details of the language
are our target concerns. Additionally, our approach can help in the sentiment analysis of other
languages, like Arabic, Persian, Hindi, Punjabi etc.
The proposed framework emphasizes on the identification of the SentiUnits, rather than, the
subjective words in the given text. SentiUnits are the sentiment carrier expressions, which reveal
the inherent sentiments of the sentence for a specific target. The targets are the noun phrases for
which an opinion is made. The system extracts SentiUnits and the target expressions through the
shallow parsing based chunking. The dependency parsing algorithm creates associations between
these extracted expressions. The framework uses the sentiment-annotated lexicon based
approach. Each entry of the lexicon is marked with its orientation (positive or negative) and the
intensity (force of orientation) score. The experimentation based evaluation of the system with a
sentiment-annotated lexicon of Urdu words and two corpuses of reviews as test-beds, shows
encouraging achievement in terms of accuracy, precision, recall and f-measure.
vi
ACKNOWLEDGEMENTS
I believe the research work presented in this dissertation from conception to completion
is a blessing from my Allah, who answered to my parents’ prayers and blessed me with
the strength. I also want to express my deepest gratitude to several individuals:
First and foremost, my utmost gratitude to Dr. Muhammad Aslam, my supervisor,
whose support and encouragement, I will never forget.
Dr. Ana Maria Martinez-Enriquez, who guided me in writing good research papers
through her thoughtful comments and suggestions.
Dr. Muhammad Ali Maud, Chairman of the Department of Computer Science and
Engineering, for his kind concern and consideration regarding my academic
requirements.
My respectable teachers during the PhD course work for their guidance and
invaluable intellect.
I am grate full to my colleagues and staff in the Computer Science and Engineering
Department.
Mr. Waqaar who assisted me in implementation and testing phase.
Lastly, I would like to thank my family for all their love and encouragement. My
parents, for being the excellent models of success and brilliance, who raised me with
a love of science and supported me in all my pursuits. My loving, supportive,
encouraging, and patient husband, Hasan, whose sincere support during all stages of
this Ph.D. gave me the feeling that I always had him on my side. Most of all, my
children Irtaza, Fatima and Ibrahim for their patience and tolerating my long study
hours.
vii
Dedicated to my Parents, Husband and Children
ix
TABLE OF CONTENTS
CHAPTER 1: INTRODUCTION 1
1.1. Research Motivation 2
1.2. Research contribution 4
1.3. The Problem of Sentiment Analysis 5
1.3.1. Targets of the appraisal 7
1.3.2. Sources of the appraisal 8
1.3.3. Appraisal expressions 8
1.3.4. Orientation 9
1.4. Sentiment annotated lexicon 9
1.5. Problem statement 10
1.6. System Evolution 11
1.7. Dissertation Outline 12
REFERENCES 111
LIST OF TABLES
CHAPTER 1
INTRODUCTION
The information in the world can be generally categorized into two key types: facts and
opinions.
The facts are objective expressions describing events, entities and their characteristic
properties.
The opinions are typically subjective expressions or appraisal expressions that
describe personal or individual sentiments or appraisals about events, entities and
their characteristic properties.
The notion of opinion is very broad. For this dissertation, our focus is on the opinions
generated by individuals, particularly the appraisal expressions which are given in the
form of web reviews.
These appraisal expressions in the form of opinions and subjective texts are very
important, and John Locke (1632-1704) very rightly said "Man is by nature a social
animal". It means that man always seeks for suggestions, opinions, and views, from other
people in society for his survival and proper decisions in every walk of life.
There are many areas of textual natural language processing like, information extraction,
summarization, information retrieval, text clustering, web search and text categorization
or classification. Little research had been done on the analysis of subjective texts for their
inherent sentiments, until only recently. One of the main reasons for the lack of study on
automatic analysis of subjective text is the fact that there was little opinionated text
available before the World Wide Web. People were used to take opinions for their friends
and relatives before taking any decision. Also, organizations were used to conduct polls,
surveys and focus groups whenever they wanted to find the opinions or sentiments of
their clients. But, in this modern era of computer and technology we are living in virtual
communities and societies. Now, internet forums, blogs, consumer reports, product
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 1| Introduction 2
reviews, and other type of discussion groups have opened new horizons for human mind.
That is why, from casting a vote to buying a latest gadget people search for opinions and
reviews from other people on the internet. They can now give their reviews about
products at business sites and convey their views on almost anything in web forums,
blogs and discussion groups, which are collectively called the user-generated content.
This is not only true for individuals but also true for organizations and companies. For an
organization or company, it may no longer be compulsory to organize focus groups,
conduct surveys, or employ external consultants in order to get client or consumer
opinions regarding its products and those of its competitors. Now, the user-generated
content on the Web can easily provide such information.
Conversely, finding the right sources of the subjective texts and monitoring them on the
Web is still a difficult task, because there is a outsized number of diverse sources, and
each source may also have a massive volume of such text. In many cases, the opinions
are hidden in lengthy forum blogs and posts. It is hard and very time consuming for a
human reader to find relevant sources, take out related sentences, read them, analyze
them, and classify them into a usable form. Thus, automated opinion or sentiment
discovery and summarization systems are needed. This need has fashioned an exciting
rather new area in text analysis which is referred by many names like sentiment analysis,
opinion mining, subjectivity analysis, and appraisal extraction (Pang and Lee, 2008). For
this dissertation we use the term Sentiment analysis.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 1| Introduction 3
On one side, the people use internet forums, blogs, consumer reports, product reviews,
and different types of discussion groups for taking everyday decisions. This text helps
them in almost all aspects of life from medical care to business proposals and from home
education to professional training.
On the other hand, the negative aspect of this opinion sharing cannot be ignored, which is
in the form of revolutionary or extremist propaganda. According to (Glaser et al. 2002)
the extremist groups use the Internet to endorse hatred and aggression. The Internet has
turned into a ubiquitous, anonymous, economical, and rapid way of communication for
such groups (Crilley, 2001). Now, people discuss each and every type of emotional
behavior in the web discussions and openly post their opinions. But, this information can
mislead the general public in their beliefs and thoughts, particularly children and youth
are more vulnerable.
Therefore, the analysis of user generated web content is not only useful for commercial
purposes, but also, its need for the discouragement of such misinformation is more
immediate, particularly, in the main languages of the world.
Consequently, the research on opinion mining and sentiment analysis on some Indo-
European languages, like, English, is flourishing and have a number of successful
contributions, (Turney 2002), (Pang et al. 2002), (Riloff et al. 2003), (Riloff and Wiebe
2003), (Tan et al. 2009) and (Bloom and Argamon 2010). They have used multiple
approaches and techniques to handle this flourishing area more effectively and most of
these contributions are very successfully performing the task of sentiment analysis. There
are now at least 20-30 companies that offer sentiment analysis services in USA alone
(Liu, 2010).
Factor II Urdu as a Morphologically Rich Language: Despite the fact that sentiment
analysis is a well explored field for English language, but, it is not yet decided whether
and how equivalent success could be attained for Morphologically Rich Languages
(MRLs) (Abdul-Mageed and Korayem 2010). The MRLs, are defined as, the languages,
in which, considerable information about the syntactic units and their relations is
expressed at word-level, i.e., the structures of the words are complex and morphological
operations like inflection and derivation are more frequent (Tsarfaty et al. 2010). Due to
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 1| Introduction 4
this word level complexity, the MRLs become more challenging for the computational
linguistics (CL) applications. This can result into intricate lexicons, complex stemming,
erroneous word segmentation and ambiguity in part of speech tagging etc. Urdu is a
worth mentioning case in this point.
Challenges in Urdu Language Processing: Given that, Urdu is a major language with
about 100 million speakers, there is a great potential in performing the sentiment analysis
on the Urdu text. As, the Urdu language is morphologically rich therefore, its constituent
words and phrases tend to be more complex, due to the recurrent derivations and
inflections. Besides, the morphological complexity, the variability in the grammar rules
and vocabulary in the Urdu text is usual and is considered acceptable. The main reason
for this phenomenon is that Urdu is influenced by many other languages, not only in
vocabulary but also in morphology and grammar, e.g., Hindi, Persian, Arabic, Sanskrit
and English, etc. The loanwords from a particular language follow their own grammar
rules. Hence, Urdu language has distinctiveness in features and linguistic aspects.
Moreover, it is altogether different from the well recognized languages in the field of
sentiment analysis and other computational linguistic applications. The computational
linguistics researchers require a comprehensive understanding of its linguistics as well as
computational aspects. Certain challenges which the Urdu language puts forward for
researchers are listed below and are explained in detail in Chapter 3.
Optional use of diacritics causes misleading parts of speech tagging.
Cursive script result into wrong word boundary identification
Frequent inflection and derivation result into complex stemming
Word level complexity makes lexicons more complex
Flexibility in vocabulary and grammar makes it difficult to define spelling and
grammar rules.
Free word order property cause misidentification of parts of speech tags.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 1| Introduction 5
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 1| Introduction 6
Given text conveys the sentiments of a consumer’s (I) opinion about a product (laptop).
The consumer is the source of appraisal and product is its main target. But, when we look
at individual sentence then the targets of appraisal are different features of the main
target, even the sources are different too. There are quite a few opinions in this review.
Sentences (2), (3) and (4) express positive orientations of the inherent sentiments, while
sentences (5), (6) and (7) express negative emotions.
All the appraisals have some targets which mainly address the central or main target, i.e.,
the laptop. This main target is addressed indirectly through its features, like in sentences
(3), (4), and (5) the target features are “processing speed”, “operating system” and
“battery life”, respectively. The expression in sentence (7) is on the cost of the laptop, but
the opinion/emotion in sentence (6) is on the consumer “me” not the product. This is a
key point. In a review, the writer may be interested in opinions on various targets, but not
on all (e.g., improbable on “me”). The source of the appraisals in the sentences (2), (3),
(4) and (5) is the consumer himself, but in sentences (6) and (7) is “mother”. Table 1.1
summarizes this discussion:
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 1| Introduction 7
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 1| Introduction 8
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 1| Introduction 9
3. For a given review, only the appraisal expressions generated by the main source are
considered.
Appraisal Expressions as SentiUnits: In our approach (presented in next Chapters), we
label the appraisal expressions as the SentiUnits. For extraction of the SentiUnits the
algorithm first identifies the subjective words according to their orientation scores
(positive or negative). Then, it attaches the polarity shifters (words which shift the
polarity or orientation of the inherent sentiment, for more detail see Chapter 4),
conjunctions, postpositions and modifiers to extract the appraisal expressions from the
opinionated sentences. The shallow parsing based chunking is applied for the extraction
of the SentiUnits, with adjectives as the head words. The overall polarity of a sentence in
a given review can be determined by computing the polarity of these expressions. These
concepts are explained in detail in Chapter 4 and 5.
1.3.4. Orientation
The sentiment classification of the review starts from the word level. Each word is
classified as subjective or objective, further each subjective word is identified as positive
or negative. This positivity or negativity of the word or phrase is called its orientation.
The words with positive orientation exhibit positive sentiments or a supportive opinion.
This orientation can have certain force or strength called its intensity. For example, the
words “good” and “better” both have positive orientation, while the intensity of the later
one is more.
Definition 6: This positivity or negativity of the appraisal expression for a specific target
or its feature is called its orientation.
Example: In the given example appraisal expressions with positive orientation are “such
a fine”, “really amazing”and “fantastic”, while, “not long” and “not happy” exhibit
negative orientation.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 1| Introduction 10
incorporates two components: (i) the classification algorithm, which analyzes and
classifies the given opinionated text according to inherent sentiments of the reviewer, and
(ii) the lexicon or lexicons annotated with the prior polarities of the lexical entries
(words/ phrases), usually as positive or negative. These prior polarity annotated lexicons
are also called sentiment-annotated lexicons (Pang and Lee, 2008).
Model: At the highest level, our lexicon model categorizes all the lexical entries into
objective and terms. Objective terms have no orientation or intensity and hence are not
marked with the prior polarity scores. Therefore, they demonstrate no effect on the
overall decision of the classification. On the contrary, subjective terms are the carriers of
the sentiments and are marked with polarity scores. Their occurrence can effect or even
altogether alter the final classification decision. With respect to orientation and polarity
the subjective terms are further categorized into three types (This model is explained in
detail in Chapter 5);
1. Absolute subjective terms with orientation only.
2. Subjective terms with intensity only.
3. Subjective terms with both values of orientation and intensity.
So far we have tried to explore the problem of the sentiment analysis by giving examples
and basic definition. Here, we define the research problem.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 1| Introduction 11
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 1| Introduction 12
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 1| Introduction 13
Chapter 2 gives a comprehensive overview of the state of the art research in the field of
sentiment analysis and Urdu language processing. It discusses features, approaches, and
techniques used for the development of the sentiment analyzer at different levels for
different languages.
The complete overview of the Urdu language, which is the main object of this research, is
given in Chapter 3. As Urdu is an entirely different language from some well explored
languages like English, therefore, we explain its characteristic features like, orthography,
morphology, syntax and grammar in more detail to augment the understandability of the
next chapters.
Chapter 4 describes the concept of the SentiUnits or the appraisal expression. Some
examples and their description augment the explanation of the structure of the SentiUnits.
The overall system’s implementation, modules and their diagrams are given in Chapter 5.
This chapter also explores the construction, integration and model of the sentiment
annotated lexicon of Urdu words.
Chapter 6 presents experimentation and results. For performance evaluation of the
sentiment analysis systems, the experiments are performed on real corpuses of user
reviews. For this purpose, reviews corpuses are collected and sentiment annotated
lexicons are developed.
Finally, Chapter 7 concludes our research contribution with some discussion points and
indications of the future endeavors.
Chapter review:
In this Chapter we defined some basic terminologies used in the task of sentiment
analysis. Using these terms we formulated our problem statement, stated the objectives,
goals and main contributions of the research.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 2| State of the Art Research 14
CHAPTER 2
The field of sentiment analysis is the center of attention for the researchers from
information retrieval, data mining, computational linguistics, and many other related
areas. There is a rapid growth of interest and the foregoing efforts have covered a broad
range of the tasks, for example, polarity classification (Pang et al. 2002), (Turney 2002),
opinion identification (Pang and Lee 2004), and opinion source assignment (Breck et al.
2007), (Choi and Cardie, 2008). Additionally, these contributions have attempted the
problem at different granularity levels. For instance, the contribution in (Pang et al. 2002)
attempts sentiment classification task at the document level. (Pang and Lee 2004)
explores sentence level classification while, (Turney 2002), (Choi and Cardie, 2008)
emphasize on phrases. The literature survey given in this Chapter covers major aspects of
SA research and also gives detailed overview of the contributions done for the language
processing of the morphologically rich languages with Urdu as a special focus.
To present a precise a literature survey for SA and Urdu language processing, we focus
on the following major aspects:
1. Features of the given text
2. Techniques
3. Sentiment annotated lexicon construction
4. Generalization among domains
5. Processing of Morphologically Rich Languages
6. Urdu Language Processing
7. Adjective based SA techniques
8. Term level vs. phrase level polarity
9. Negation Handling in SA
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 2| State of the Art Research 15
Researchers have focused on a number of features of the given text for achieving better
classification results. These features are encoded into feature vectors for the proper
application of machine learning algorithms (Pang and Lee 2008). Thus, feature selection
is a critical task and can affect the results to a great extend. Syntactic, semantic, linking
based, term based, topic oriented and part of speech based features are frequently used in
literature. In following, we discuss four categories, which are Part of speech (POS) based,
term based, syntactic, and topic oriented.
The POS based information, particularly, of adjectives, can help a lot in sentiment
analysis. That is why the earliest work in this domain uses adjectives as subjectivity
indicators (Hatzivassiloglou and McKeown 1997). After that, (Hatzivassiloglou and
Wiebe 2000; Mullen and Collier 2004), and (Whitelaw et al. 2005) present their
approaches to handle adjectives using multiple techniques. (Turney 2002) argues that,
proverbs are also carriers of sentiments in a sentence and should be considered in
combination with adjectives. The sentences are divided into pre-structured grammatical
patterns, which include adjectives and adverbs as the core words. (Riloff et al. 2003)
attempts a relatively new idea and proposes the analysis of nouns in the text. It
emphasizes on the concept of subjective nouns and computes the orientation for the
phrases in the sentence which contained them.
Many works are available in which term based features are considered. For example, the
position of the term in a sentence is put forward as a feature by (Kim and Hovy 2006).
This work locates the specific terms, and then, according to their position, it computes
subjectivity orientation. Another work, (Wiebe et al. 2004) applies the concept of hapax
legomena for feature selection, which means, a word occurring only once in a given
corpus. It proposes that the word that appear only once in the corpus are more subjective
than the others. In addition to this feature, it uses a relatively complex syntactic feature,
i.e., collocations of the words in a sentence. If some words or terms co-occur more
frequently than usual, then, these are considered as collocations. According to (Yang et
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 2| State of the Art Research 16
al. 2006) the terms which are rare and are not entered in a prefixing dictionary tend to be
more subjective, because, the reviewers use them to emphasis their opinion.
Table 2.1
Features used and their respective contributions.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 2| State of the Art Research 17
(Pang et al. 2002) states better performance, using “presence of term” as a binary-valued
feature vector, whose entries merely specify, whether a term occurs (0, 1) or not. But, in a
term frequency feature vector entry values increase with the occurrence frequency of the
corresponding term (Abdul-Mageed and Korayem 2010). Bigrams and trigrams are used
by (Dave et al. 2003). (Kennedy and Inkpen 2006) and (Snyder and Barzilay 2007)
consider contrastive distance between terms as an automatically computed feature.
(Whitelaw et al. 2005) uses the concept of appraisal theory and extracts appraisal
expressions with the help of sentiment lexicon. (Mullen and Collier 2004) observes that,
the sentences which contain a reference to the topic, can be considered more important.
For this purpose, it specifies words and word phrases which, can be extracted as
indicators of the reference. The above discussed features and related contributions with
some further examples are summarized in Table 2.1.
2.8. Techniques
There are a number of techniques used for sentiment analysis, e.g., unsupervised
bootstrapping, sentiment lexicon and support vector machines (see Table 2.2.). In
unsupervised bootstrap approach, a primary or initial classifier is applied on the text to
generate labeled data as the output. After that, a supervised learning algorithm may be
applied on this data. The initial classifier can have various implementation possibilities,
according to the language complexity and depth of the required analysis. An example of
such an initial high-precision classifier to learn extraction patterns for subjective terms is
proposed by (Riloff and Wiebe 2003). (Kaji and Kitsuregawa 2007) uses this method for
the automatic construction of HTML documents based corpus in which, the polarity
labels are assigned to the entries.
(Hatzivassiloglou and Wiebe 2000; Turney 2002; Yu and Hatzivassiloglou 2003; Riloff
et al. 2003) and (Higashinaka et al. 2007) employ sentiment-annotated lexicon induction
technique. As a first step, an unsupervised approach is applied for the generation of a
sentiment-annotated lexicon. Then using this as a resource, the given text is classified as
positive or negative.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 2| State of the Art Research 18
(Hu and Liu 2004) and (Andreevskaia and Bergler 2006) use Preston WordNet for
extraction of sentiment tags. There is also a trend in research community to extend
existing lexicons, e.g. SentiWordNet is an extension of the WordNet.
Table 2.2.
Techniques used by different contributions.
As we are using the lexicon based approach for the development of the sentiment
analyzer so we discuss here some contributions from this aspect of the research. Lexicon
construction with an apposite coverage is a challenging task. From definition of grammar
rules to their appropriate implementation, it requires much expertise and proficiency
about the target language as well as the computer algorithms. For the task of sentiment
analysis the entries of these lexicons are annotated with the orientation scores in addition
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 2| State of the Art Research 19
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 2| State of the Art Research 20
2006) and (Hu and Liu 2004) utilize WordNet or its extensions for the sentiment analysis.
Moreover, (Hatzivassiloglou and Wiebe 2000; Turney 2002; Yu and Hatzivassiloglou
2003; Riloff et al. 2003) and (Higashinaka et al. 2007) have tried to develop algorithms
and techniques for automatic lexicon construction using unsupervised learning methods.
All these discussed contributions are summarized in Table 2.3.
Most of these efforts use pre-developed linguistic recourses like corpuses for the
development and extraction of required lexicons. But, Urdu is a recourse poor language
and hence the task of lexicon construction becomes more difficult and time consuming.
To our knowledge no such lexicon exists for Urdu text. However, there are a very few
efforts who have tried to construct corpuses and simple lexicons for other NLP
applications.
Table 2.3.
Lexicon construction research.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 2| State of the Art Research 21
The preliminary work is presented for the EMILLE (Enabling Minority Language
Engineering) project in the form of a multi-lingual corpus for the South Asian languages.
A parallel corpus for Hindi, Urdu, English, Bengali, Punjabi and Gujarati languages
contains about 200,000 words (Baker et al. 2003). Their independent corpus of Urdu text
has 1,640,000 words annotated with POS tags (Hardie 2003).
Another effort is presented in (Ijaz and Hussain 2007). They use corpus to automatically
develop Urdu lexicon. Their corpus is based on cleaned text from news websites,
containing about 18 million words. The work (Muaz et al. 2009), gives brief analysis of
parts of speech of Urdu language and develops a POS tagged corpora, whereas, another
effort (Mukund et al. 2010) generates semantic role labeled corpus for Urdu text using
cross lingual projections. (Humanyoun et al. 2007) presents the extraction and
development of the automatic extraction of Urdu lexicon using corpus. Table 2.4 shows
the corpuses and lexicons developed for Urdu language for different applications of NLP.
Table 2.4.
Corpuses and lexicons for Urdu language.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 2| State of the Art Research 22
specific (Pang and Lee 2004). (Tan et al. 2009) handles the domain adaptation issue using
frequency co-occurring entropy (FCE) method. It emphasizes on a smooth transformation
from a domain d1 to another domain d2 through a set of generic features F, representing
d1 and d2. It evaluates the model for six domains and finally concludes that FCE is not the
best option. Another feature related to multiple domains is their complexity level.
Sentiment analysis of reviews related to products and movies is considered as the easiest
in literature (Pang and Lee 2004) and these reviews serve as a test bed for most of the
approaches. On the contrary, political speeches and discussions are perhaps the most
complex to handle. (Bansal et al. 2008) pinpoints an issue and evaluates whether the
speech is in favor or opposition.
The Morphologically rich languages or MRLs are challenging domain for NLP
researchers. Still there are a number of worth mentioning contributions. For example, a
stemming model for classical Arabic in Holly Quran is presented by (Thabet 2004). This
work uses the stop-word list and makes lists of words from every surah. Both lists are
compared and when some words in the created list do not exist in the stop-word list, then
the algorithm remove the prefixes. The accuracy of the algorithm is 99.6% for the prefix-
stemming and 97% for the post-fix-stemming. In (Paik and Parui 2008) presents a general
analysis of the languages spoken in India, particularly, Marathi, Hindi, and Bengali. In
this work different similarity classes are made of all the lexical entries by using the match
of the prefix. This match is done with respect to a predefined length. Another stemmer
for Hindi Language is proposed by (Kumar and Siddiqui 2008), which compute n-grams
of the words with the given length. The algorithm treats these n-grams as the postfixes
and extracts the possible stems with postfixes. Finally, the combination of postfix and
stem with maximum probability is picked with a reported accuracy of 89.9%. A Telgu
language based stemmer in (Akram et al. 2009), presents the statistical techniques and
suggests that this MRL require deeper linguistic analysis for improved results.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 2| State of the Art Research 23
Orthographically and grammatically Urdu and Persian language have many similarities.
This is because of a large number of vocabulary matching. (Sharifloo and Shamsfard
2008) present a rule-based bottom up algorithm for stemming of Persian text. The
algorithm first extracts the core substring of the words, and compares them with already
defined cores using some grammar rules. This matching of the strings is done by the
already defined morpheme clusters. Moreover, the accuracy is enhanced to about 90.1%
by applying an anti-rule-procedure.
There are some worth mentioning contributions for handling sentiment analysis in MRLs
For example, (Abdul-Mageed and Korayem 2010) and (Abbasi et al. 2008) for Arabic,
and (Jang and Shin 2010) for Chinese language, etc. The work presented in (Abdul-
Mageed and Korayem 2010) is for sentiment analysis of the Arabic text. In this work, the
main focus is on the Arabic text related issues for the development of a practical analyzer
with acceptable performance. It analyzes news text by automatic classification at the
sentence level. It applies a support vector machines classifier. Another related work is
(Abbasi et al. 2008). It performs sentiment analysis of Arabic and English web forums.
Its emphasis is on the extremist opinion propagation. For handling Arabic language’s
characteristics, it proposes specific feature extraction components. It develops Entropy
Weighted Genetic Algorithm (EWGA), a hybridized genetic algorithm that incorporates
the information gain heuristic for feature selection, i.e., stylistic and syntactic features.
This algorithm improves the system performance by selecting better key features.
Due to the idiosyncratic linguistic features of the Urdu language and an exclusive set of
morphological and grammatical rules, computer based processing of Urdu is not a very
well explored dimension. There are our contributions in the subjectivity or sentiment
analysis of the Urdu text (Syed et al. 2010; Syed et al. 2011; Syed et al. 2012).
In this section, we present a brief survey of the major NLP contributions for Urdu
language, which are useful for sentiment analysis;
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 2| State of the Art Research 24
A variety of phrases exist in Urdu including verb phrases, noun phrases, and adjectival
phrases. Identifying these phrases in a sentence is very helpful for various applications in
NLP, like, information retrieval or extraction, parsing, sentiment analysis, machine
translation, and question answering. The procedure which directly tags these phrases is
called phrase chunking or simply chunking. For Urdu phrase chunking, a very prominent
contribution is (Ali and Hussain 2010), which describes the structure of Urdu verb
phrases, and applies a series of experiments to automatically label them. It uses a
manually tagged corpus of 100,000 Urdu words with verb phrase chunk tags. The
reported results of this effort give 98.44% accuracy. It uses a hybrid approach with
extended tag set.
As Urdu and Hindi are very similar morphologically, so we discuss two contributions for
phrase chunking from Hindi text (Singh et al. 2005; Dalal et al. 2006). In the former,
HMM based chunk tagger is presented for Hindi language. The chunk tagging is divided
into two sub tasks: the identification of the chunk boundaries and then labeling of the
chunks according to their types. In (Dalal et al. 2006) Hindi tagger uses a statistical
approach based on maximum entropy. Simultaneously various features are used for the
prediction of the word tags. The proposed feature set is largely classified as the set of
dictionary of context-based features, word features, and corpus based features. A corpus
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 2| State of the Art Research 25
of more than 35,000 words/phrases is used for testing and training, reporting an accuracy
of 87.4%.
Urdu is rich in both derivational and inflectional morphology. For example, the verbs
inflect to agree with case, number, respect, and gender. Also the verbs is inflected by the
mood (e.g., imperative, infinitive), tense (e.g., present, past), habitual. In, (Akram et al.
2009) states that only the verbs in Urdu have sixty inflected variations. Moreover, the
adjectives also show agreement for case, number, and gender. (Syed et al. 2012)
describes this phenomenon in detail. The intense inflectional and derivational behavior of
Urdu, entails the stemming of the Urdu text a quite challenging process, because the
stemming become harder to devise as the character encoding, morphology, and script of
the language becomes more intricate. For example, Italian language has more inflections
so the stemming is more complex than that of English.
Arabic is also a MRL, so the stemming task becomes even harder. (Riaz 2010) suggests
that Arabic and Farsi stemming process cannot be used for Urdu due to the inflections,
producing erroneous results. Besides, dictionary/lexicon based error correcting schemes
used by other stemmers cannot be applied to Urdu because of the dearth of machine-
readable resources. An Urdu stemmer (Akram et al. 2009) focus on a rule based
approach, which removes the prefix and the postfix before adding letter or letters to
generate the surface from the stem. The exception lists are created and used to complete
the first two steps of the algorithm. If the lookup is successful then the stripping process
is bypassed. (Riaz 2010) describes the challenges related to the Urdu stemming and
proposes a rule-based model with a few rules implemented to stimulate the intricacies.
For Indo-Aryan languages like Urdu, there are merely a little lexical resources available
and accessible for performing research. For example, for Hindi language a lexical
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 2| State of the Art Research 26
recourse like English WordNet is presented as Hindi Wordnet (Bhattacharyya et al. 2008;
Bhattacharyya 2010). The methodology and architecture of this resourse is based on the
English WordNet (Fellbaum 1998). Urdu WordNet (Ahmed and Hautli 2010) is
developed by using the same approach. As far as corpus construction is concerned the
Enabling Minority Language Engineering project is a considerable attempt. It is
functioning on a multi-lingual corpus for the South Asian Languages. An independent
parts-of-speech-tagged corpus for Urdu text is developed with about 1,640,000 words
(Hardie 2003). Another parts-of-speech-tagged corpus is presented by (Muaz and
Hussain 2009). (Humanyoun et al. 2007) presents the extraction and development of the
automatic extraction of Urdu lexicon using corpus. Also, (Ijaz and Hussain 2007)
presents the development of an Urdu lexicon from the given corpus. The corpus is based
on cleaned text from Urdu news websites, having nearly 18 million words.
(Hualti and Butt 2011) describes a computational semantic analyzer as part of the parallel
grammar project and is based on the syntactic analysis done for the Urdu grammar
component of the ParGram. In addition to the semantic construction some peripheral
lexical resources like a preliminary Urdu WordNet and a VerbNet are developed and
integrated with the main model. Such resources help to generate a more comprehensive
representation of lexical knowledge, e.g., hyponyms for words and their thematic roles.
There is some other worth mentioning contributions in Urdu NLP. For Example,
(Mukund et al. 2010) employ cross lingual projections in the PropBank paradigm for the
automatic induction of the semantic role annotations for the Urdu text. These annotations
are done on the basis of the word alignments. An Urdu-English parallel corpus is used by
the projection model to utilize syntactic as well as lexical information. The reported
accuracy of the annotations is 92% on short sentences.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 2| State of the Art Research 27
Table 2.5.
Urdu language processing.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 2| State of the Art Research 28
(Rizvi and Hussain 2005) describe computational investigation of different Urdu parts of
speech. Their work is more theoretical based, hence it can be used to define and
implement the rules for many language processing applications.
(Mukund and Ghosh 2011) describes the automatic extraction of the opinion holder
words and phrases from the given Urdu texts. This work refers the opinion holders and
their targets together as the opinion entities. It works in two steps; generate required word
sequences related to the opinion entities and disambiguate these extracted sequences as
the holders or targets of the opinions. The morphological operations like inflections are
used to correctly identify sequence boundaries for the verbs and nouns. Another work in
the context of classification of objective and subjective sentences is attempted by
(Mukund and Srihari 2010), which employs a vector space model.
As already mentioned in Section 2.1, the part of speech based features of the given text,
particularly of adjectives, can help a lot in sentiment analysis. Here, we emphasize on
these adjective based approaches used by NLP community. One of the earliest works in
this domain (Hatzivassiloglou & McKeown, 1997) uses adjectives as subjectivity
indicators. They employ a log-linear regression model for identification and validation of
the positive or negative semantic orientation of the conjoined adjectives. A clustering
algorithm divides the adjectives into groups with respect to orientations, and labels them
as positive or negative. Before that (Hatzivassiloglou & McKeown, 1993), present an
approach for automatic recognition of adjectival scales this approach group or cluster the
adjectives carrying same semantics, but this was not with the perspective of sentiment
analysis. (Bruce & Wiebe, 2000) recognize subjectivity within the text by manual
tagging. They take a case study of sentence level categorization and categorize clauses
from the “Wall Street Journal” as objective or subjective. Each clause is given a final
classification on the basis of an agreed decision by four judges.
(Hatzivassiloglou & Wiebe, 2000) analyze two main features of adjectives for
subjectivity prediction, i.e., gradability and semantic orientation. They extract reliability
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 2| State of the Art Research 29
of gradability values using an automatic method for extracting. (Turney, 2002), suggest
that the proverbs are also carriers of sentiments in a sentence and should be considered in
combination with adjectives. In their work, the sentences are divided into pre-structured
grammatical patterns, which include adjectives and adverbs as the core word. (Riloff et
al., 2003) emphasize on the identification of the subjective nouns, which are modified by
the use of adjectives. They compute the orientation of the phrases in the sentence that
contained them. (Riloff & Wiebe, 2003), use unsupervised learning method for automatic
extraction and learning of the patterns for subjective expressions in the given text.
Table 2.6.
Research contributions related to adjective based sentiment analysis.
(Whitelaw et al., 2005) propose the use of appraisal theory for sentiment analysis. They
work on appraisal expressions extraction. These appraisal expressions are the sentiment
oriented phrases which contain adjectives as head words. (Bloom & Argamon, 2010)
extended this model and propose an approach for automatic learning of these appraisal
expressions. Research contributions related to adjective based sentiment analysis are
shown in Table 2.6.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 2| State of the Art Research 30
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 2| State of the Art Research 31
Table 2.7.
Term-level polarity vs. phrase-level polarity approaches.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 2| State of the Art Research 32
Table 2.8.
Negation handling for sentiment analysis.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 2| State of the Art Research 33
Chapter review:
This Chapter describes the state of the art research in sentiment analysis and Urdu
language processing. The literature survey is divided into following sections; Features of
the given text, techniques, sentiment annotated lexicon construction, generalization
among domains, processing of morphologically rich languages, Urdu language
processing, adjective based SA techniques, term level vs. phrase level polarity and
negation handling in SA.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 3| Distinctive Features of the Urdu Language 34
CHAPTER 3
Prior to reporting our research contributions, there are some background issues that must
be presented and discussed. Firstly, we describe the Urdu language itself, which is the
main entity of this investigation. Urdu is introduced briefly to provide background for the
discussion in later chapters. As this language is not widely studied, therefore this section
contains more detail than would be necessary if a more recognizable language, such as
English or French, was being studied.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 3| Distinctive Features of the Urdu Language 35
Vernacular Urdu. Moreover, Urdu is the national language of Pakistan and is widely
spoken in India, Afghanistan, Bangladesh, Bahrain, Oman, Saudi Arabia, South Africa,
and United Kingdom. Some Salient features of Urdu Language are given in Table 3.1.
The distinctiveness of a language is recognized by its inherent characteristics, which are
its orthography, vocabulary, parts of speech, grammar and morphology. We present here
a precise overview of these characteristics of the Urdu language:
3.1 Orthography
The orthography of a language specifies a standardized method for using a specific script
or writing system as a set of symbols (alphabets); graphemes and diacritics, and the rules
about how to write these symbols. It refers to the relationships between the graphemes
and phonemes for generating word spellings. It also identifies; the diacritics,
capitalization, hyphenation, word boundaries, punctuation marks and emphasis. The
orthography of the Urdu language is inclined toward the Arabic and Persian influences.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 3| Distinctive Features of the Urdu Language 36
The Arabic script employs letters to represent consonants and diacritics to indicate the
vowels. In Urdu both long and short vowels exist. Diacritics are used on the consonants
to specify the short vowels. Whereas, the long vowels are indicated by; a combined effect
of the consonant with diacritic and an additional letter. These diacritics are optional and
usually not written, but they exist implicitly and the native speaker understands their
pronunciation. From Figure 3.2 it is clear that the diacritics a consonant can have two
didactics and these can be written above or below the consonant.
ّ ب ْ ب ًب ٰ ب
ُ ب َ بِ ب
Figure 3.2 Diacritics in Urdu with letter “”ب.
3.1.2. Word order
Generally the basic word order of the Urdu clause is given as subject object verb (SOV).
Variation in this word order is common, particularly the reordering of nominal
constituents, especially for thematic purposes. This is the reason that (Butt, 1995) argues
that Urdu is a free order or a non-configurational language.
3.1.4. Ligatures
Urdu uses Persio-Arabic script, which is cursive and context-sensitive with respect to the
shapes of the alphabets. It means that the “( ”ﺣﺮوفharoof, alphabets) have multiple
glyphs and shapes and are categorized as joiners and non-joiners. The joiner alphabets
join together into units, called the ligatures (Durrani and Hussain 2010). One word can
have either single or multiple ligatures. During writing, all characters join together until a
non-joiner appears. A new ligature starts after the non-joiner. The process is repeated
until the word ends. If there are more than one ligatures present in a word then it seems
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 3| Distinctive Features of the Urdu Language 37
that the word is having a space within, but this space is not their. Consider the example of
“( ”ﺟﺎﻧﻮرjanwar, animal), this word have three ligatures which are written without space,
whereas the word “( ”ﮨﻤﺖhimat, courage) have only one ligature. There is also a
possibility of separation of the ligatures, even in the absence of a non joiner. For
example, “( ”ﮐﺒﮭﯽ ﮐﺒﮭﯽkabhi kabhi, sometimes) and “( ”ﺑﮯ ﺟﺎنbay jaan, lifeless) this
phenomenon is very common in compounding and reduplication of the words.
An Urdu character exhibits multiple shapes according to its position in the ligature, i.e.,
in the initial, medial, or final position, or it remains unconnected. For example, consider
the alphabet “( ”جjeem). It can be joined in initial position as “”ﺟﺎ, in medial position as
“ ”ﺑﺠﺎand at final position as “”ﺣﺞ, see Table 3.2.
Due to this context sensitive orthography and difference in the behaviors of joiners and
non-joiners, the word boundary identification becomes a major task. The space is not
always an indicator of the word boundary.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 3| Distinctive Features of the Urdu Language 38
6. Post Positions
7. Numerals
8. Auxiliaries
9. Conjunctions
10. Haroof
11. Case markers
Among these, nine (from 1-9 in the above list) are similar to the English parts of speech
in their semantics (though, their morphological and grammar rules are clearly distinct).
While the “( ”ﺣﺮوفharoof) and case markers are different. The “haroof” are the words
which have no independent meaning. To become meaningful they are used with other
words (Schmidt, 2000). For example, “( ”اےay), “( ”اوo), “( ”واهwah), and “( ”ﻧﺎna), etc.
3.3. Vocabulary
The absorption power of Urdu is quiet exceptional. In addition to Arabic, Persian, and
Turkish influences, Urdu kept on including the vocabulary from English, Sanskrit and
Hindi. This potential enhances the magnificence of the language. Table 3.3 gives some
examples of the Urdu words taken from English, Persian, Sanskrit, Arabic and Turkish,
along with their use in the sentences.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 3| Distinctive Features of the Urdu Language 39
3.4. Morphology
Morphology can be defined as the study of the structure of the word. For example, the
word “( ”ﻟﻔﻆlafz, word) describes how “( ”اﻟﻔﺎظalfaaz, words) is inflected from it. The
definition of morphology leads to the concept of morpheme the smallest unit of meaning
or smallest recurring unit. The relation of morphology to morpheme is same as that of the
syntax to the words. Morphemes express concepts like “( ”ﺑﺎدلbadal, cloud), “”ﭘﻨﮑﮭﺎ
(pankha, fan), or relationship like “( ”ﻣﻨﺪmand) in “( ”دوﻟﺖ ﻣﻨﺪdolat mand, rich) and “”ﺑﮯ
(bay) in “( ”ﺑﮯ ﺟﺎنbayjaan, lifeless). Also morphemes can express syntactic features for
example number (singular, plural) e.g., “( ”ﭘﻮداpoda, plant), “( ”ﭘﻮدےpoday, plants)
Gender (male, female) e.g., “( ”ﮔﯿﺎgya, went, inflected for masculine), “( ”ﮔﺊgayee, went,
inflected for feminine).
The term morph represents morphemes as parts of a word, e.g., In the word “( ”ﭘﺮpur,
feather) the morpheme “( ”ﭘﺮpur) is realized as the morph “( ”ﭘﺮpur) to form the word
“( ”ﭘﺮpur, feather). In, “( ”ﭘﺮوںpuron, feathers), the morpheme “( ”ﭘﺮpur) and the
PLURAL morpheme are realized as “ ”ﭘﺮ+”( ”وںpur+oon) respectively to form the word
“( ”ﭘﺮوںpuron, feathers).
The term allomorphs represent different forms of a morpheme. e.g., the PLURAL
morpheme in Urdu has several allomorphs. Plural of “( ”ﭘﻮداpoda, plant) is “( ”ﭘﻮدےpoday,
plants) plural of “( ”ﭘﮭﻮلphool, flower) is “( ”ﭘﮭﻮﻟﻮںphoolon, flowers). The morphemes are
further categorized as free morphemes (can form words by themselves) e.g., “”ﺑﺎرش
(barish, rain), “( ”آﺳﻤﺎنaasman, sky) and bound morphemes (must be combines with
other words) to form words e.g., “( ”ﺑﺎba) in “( ”ﺑﺎﻋﺰتbaizat, respectable). Words can be
found as free morphemes only, bound morphemes only, free and bound morphemes
jointly.
As far as Urdu morphology is concerned, it lies in the category of morphologically rich
languages (MRLs) like Arabic, Persian, Chinese, Turkish, Finnish, and Korean. The
MRLs require considerable challenges for natural language processing, machine
translation and speech processing (Abdul-Mageed and Korayem, 2010). These languages
are distinctive due to highly productive and frequent morphological processes at the word
level, e.g., compounding, reduplication, inflection, agglutination and derivation, etc. Due
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 3| Distinctive Features of the Urdu Language 40
to these morphological operations the same root words can generate multiple word forms.
This makes the stemming process quite challenging.
Also, the Lexicons of MRLs tend to be more complex. The dependencies and
relationships between different parts of speech are frequent. This increases the levels of
intricacy, which result into inflection or derivation gaps, because various forms of the
same underlying base-form can easily be misidentified as unrelated entries with negative
effects on the overall alignment of words and hence, on the processing accuracy.
Some frequent morphological processes for Urdu are discussed below:
Inflectional operations deal with the variety of forms of the same words. The changes
indicate grammatical features, e.g., “( ”ﺟﺎﻧﺎjana, to go) from “( ”ﺟﺎja, go). The difficult
aspect of these inflections is their diversity. For example, for making a plural in English
s, es or ies are used according to the predefined grammatical rules. Exceptions are there,
but are rare.
On contrary, in Urdu language, the Arabic loan words are made plural according to
Arabic grammar, whereas, the Persian loan words follow the Persian grammar and so on.
For example, the plural of “( ”ﻟﻔﻆlafz, word) is “( ”اﻟﻔﺎظalfaaz, words) and “( ”ﭘﻮداpoda,
plant) is “( ”ﭘﻮدےpoday, plants). Both are differently inflected to make plural word.
Derivational operations deal with the production of new words with different meanings.
The new words are produced by adding affixes. Often the produced words have a
changed part of speech, e.g., “( ”ﺧﻮشkhush, happy) and “( ”ﺧﻮش ﺑﺨﺖkhushbakht, lucky).
3.4.2. Compounding
The compounding process results into new words which are made by a combination of
two already existing words M and N. Some examples of compound words in Urdu are:
MN formation: M and N are independent in meaning and syntax but they are only
written together to make a new word.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 3| Distinctive Features of the Urdu Language 41
For example, M = “( ”ﻣﻮمmom, wax), N = “( ”ﺑﺘﯽbati, light), make the word MN = “ ﻣﻮم
( ”ﺑﺘﯽmombati, candle).
M-O-N formation: M and N are independent words, but are related in meaning or
context. Their syntax remains the same with an additional alphabet “( ”وO). This
alphabet “( ”وO), means “and”.
For example, M = “( ”ﻣﻠﮏmulk, country), N = “( ”ﻣﻠﺖmilat, nation), make the
compound word, M-O-N = “( ”ﻣﻠﮏ و ﻣﻠﺖmulk-o-milat, country and nation).
3.4.3. Reduplication
Both full and partial reduplication of words is very common in Urdu. For example, the
full reduplication of the word “( ”ﮐﺒﮭﯽkabhi, sometime), result into “( ”ﮐﺒﮭﯽ ﮐﺒﮭﯽkabhi
kabhi, infrequently).
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 3| Distinctive Features of the Urdu Language 42
So far we have gone through an overview of the Urdu language. Here, we precisely
describe the challenges posed due to the distinctive features. These aspects are related to
the task of sentiment analysis, like corpus collection, lexicon construction, and word
boundary identification, etc.:
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 3| Distinctive Features of the Urdu Language 43
worship) are Arabic, Persian, English, and Sanskrit loan words, respectively. Due to this
variability, the morphological operations use varying grammar rules. Most of the loan
words follow the grammar rules of their parent language. Generally, the Sanskrit based
adjectives show inflection to agree with the noun they qualify, this property is called
marking with respect to case, gender, or number. Like the demonstrative adjective “”ﺟﯿﺴﺎ
(jaisa, such as), becomes “( ”ﺟﯿﺴﯽjaisee, such as) and “( ”ﺟﯿﺴﮯjaisay, such as) for gender
and number, respectively. On the other hand, most of the Persian loan words like “”ﺗﺎزه
(tazah, fresh) remain unmarked, because, they follow Persian grammar.
Technically, these features result into much intricate lexicons for natural language
processing applications. There is a much higher out of vocabulary rate as compared to
other well defined grammars. Also, it results into poor or unreliable language model
probability estimation, because there are many combinations of word forms which are
missing or rarely available in the language model training data.
Table 3.5 Inflection of multiple words from root word “ ”ﻋﻠﻢin the Urdu language.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 3| Distinctive Features of the Urdu Language 44
“( ”ﺟﮕﻨﻮjugnu, firefly) is a single ligature word but the word “( ”ﺟﺎﮔﻮjago, wakeup) has
two ligatures. If the ending letter of a word is a joiner then it tends to join with the first
letter of the next word, resulting into a misidentification of the word boundaries. For
example, “( ”ﮐﻞ راتkal raat, tomorrow night) are two different words and are written
with space but if by mistake this space is omitted then the last non joiner of the first word
will join with the first letter of the second word and it will become “( ” ﮐﻠﺮاتkalraat).
Hence, the spaces are not always true indicators of the word boundaries as in English
text.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 3| Distinctive Features of the Urdu Language 45
case marker “( ”ﮐﮯkay). Some more examples of the use of case markers are “ اﯾﺮان ﮐﺎ
( ”ﺑﺎدﺷﺎهIran ka badshah, king of Persia), and “( ”ﺷﯿﺸﮯ ﮐﯽ ﺑﻮﺗﻞsheeshay ki bottle, glass
bottle).
Moreover, Urdu text contains two types of affixes, (a) morphemes and (b) words or
lexical units. Morphemes are lexically attached with the nouns through morphological
operations. For example, to make plural “( ”ﭘﻮدےpoday, plants) of the word “( ”ﭘﻮداpoda,
plant) plural postfix “( ”ےay) is applied as shown in Table 3.6.
While the words or lexical units are independent units. These are further categorized as
case markers, pure postpositions and possession or genitive markers. The case markers
are further divided into core case markers and oblique case markers. They mark
grammatical function to the marked words and are generally, morphologically attached
with the words at the lexical level. But, in Urdu, they are syntactically attached and
lexically independent.
As an example of core case markers consider the sentence, “( ”ﻣﯿﮟ ﻧﮯ ﮐﮩﺎmein nay kaha, I
said), in which the case marker “( ”ﻧﮯnay) is used. Similarly, in the sentence “”آپ ﮐﺎ ﻧﺎم
(aap ka naam, your name), the possession marker “( ”ﮐﺎka) is used. Table 3.6 gives some
more examples.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 3| Distinctive Features of the Urdu Language 46
( ﻧﮩﯿﮟ ﺗﯿﺮا ﻧﺸﯿﻤﻦ ﮐﺜﺮﺳﻠﻄﺎﻧﯽ ﮐﮯ ﮔﻨﺒﺪ ﭘﺮnaheen tera nasheman kasr-e sultani kay gunbad par)
( ﺗﯿﺮا ﻧﺸﯿﻤﻦ ﮐﺜﺮﺳﻠﻄﺎﻧﯽ ﮐﮯ ﮔﻨﺒﺪ ﭘﺮ ﻧﮩﯿﮟtera nasheman kasr-e sultani kay gunbad par naheen)
( ﺗﯿﺮا ﻧﺸﯿﻤﻦ ﮔﻨﺒﺪ ﮐﺜﺮﺳﻠﻄﺎﻧﯽ ﭘﺮ ﻧﮩﯿﮟtera nasheman gunbad-e kasr-e sultani par naheen)
Translation: Your home is not on the tower of the king’s palace)
Chapter review:
This chapter precisely described the linguistic characteristics of the Urdu language. From
this description we believe that Urdu language is unique in a number of aspects related to
its orthography, morphology, grammar and vocabulary. Its distinctive linguistic features
make it a challenging domain for the sentiment analysis community. Hence, they require
updated or altogether different algorithms and approaches to analyze the sentiment
orientation of the Urdu text.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 4| SentiUnits: The Appraisal Expressions 47
CHAPTER 4
In an opinionated sentence all the terms are not subjective. Indeed the sentimentality of a
sentence depends only on some specific words or phrases. Consider the examples “This
book is very good.” and “The movie is boring.” underlined words are the expressions
made of one or more words which carry the sentiment information of the whole sentence.
We label them as SentiUnits. We can judge only these units as the representatives of the
whole sentence’s sentiment. These are in fact the appraisal expressions as defined and
discussed in Chapter 2. The SentiUnits can be defined as the core grammatical structures,
expressing the opinion or the sentiment carrier expressions in a sentence (Syed et al.
2010). For understanding the structure of the SentiUnits, consider the following examples
from Urdu text in Table 4.1.
This is a fine book. Yeh aik umdah kitab hay . ﯾہ اﯾﮏ ﻋﻤﺪه ﮐﺘﺎب ﮨﮯ1
This is a fine and informative book. Yeh umdah aur malumati kitab hay . ﯾہ ﻋﻤﺪه اور ﻣﻌﻠﻮﻣﺎﺗﯽ ﮐﺘﺎب ﮨﮯ2
This is the finest book. Yeh sab se umdah kitab hay . ﯾہ ﺳﺐ ﺳﮯ ﻋﻤﺪه ﮐﺘﺎب ﮨﮯ3
This book is not very bad. Yeh kitab itni buri naheen . ﯾہ ﮐﺘﺎب اﺗﻨﯽ ﺑﺮی ﻧﮩﯿﮟ4
Table 4.1. Examples of opinionated sentences from Urdu with different SentiUnits.
In Table 4.1, the underlined expressions are responsible for subjectivity orientation. All
other words are neutral and have no effect on the classification. On a closer look at these
examples, we can observe that the SentiUnits are made of adjectives (as head words).
These can be single word/adjective based like sentence 1, or multiple words based like
sentences 2, 3 and 4. The sentence 1, 2 and 3 have adjectives with positive orientation,
whereas, the sentence 4 contains a negative word but due to the use of negation it
becomes positive. In this case, negation acts as a polarity shifters. Moreover, the intensity
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 4| SentiUnits: The Appraisal Expressions 48
4.1. Adjectives
An adjective is a fundamental part of speech (POS) that expresses an attribute of a noun
(place, thing or, person). Generally in the sentence structure adjectives appear in two
ways, whether they are directly linked with the noun within the noun phrase or they
associate with the noun through some other part of speech, e.g., verb. In both cases they
describe the characteristic features of the noun they qualify. This point suggests that any
opinion, sentiment or judgment about a noun can be determined by analyzing its
adjectives. Due to this characteristic the first effort for the automatic sentiment analysis
(SA) of the English text employ adjectives as the main feature of the given text
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 4| SentiUnits: The Appraisal Expressions 49
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 4| SentiUnits: The Appraisal Expressions 50
morphological structure of the adjectives and their inflected forms. We take most
commonly used adjectives as examples and clearly describe their modifications.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 4| SentiUnits: The Appraisal Expressions 51
a) Adjective marking: agreement in gender and number: The adjective marking is done
through the suffixes for gender; masculine (m) and feminine (f) and for number; singular
(s) and plural (p). For example, the masculine adjective, “( ”اﭼﮭﺎacha, good) is inflected
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 4| SentiUnits: The Appraisal Expressions 52
for gender as “( ”اﭼﮭﯽachi, good) and for number as “( ”اﭼﮭﮯachay, good). These suffixes
are attached to agree with the noun or nouns, which the adjective qualifies. Therefore,
there are three suffixes, i.e., singular-masculine (a), singular-feminine (ee) and plural-
masculine (ay). Only one feminine suffix (ee) is used for singular and plural both.
Some examples of marked adjectives are given in Table 4.3, in this table we have
considered three nouns; (a) masculine-singular, “( ”ﺑﭽہbacha, kid), (b) feminine-singular,
“( ”ﮐﺎرcar, car) and masculine-plural “( ”دنdin, days). These nouns cause inflection in the
respective adjectives; “( ”اﭼﮭﺎacha, good), “( ”ﻟﻤﺒﺎlamba, long), and “( ”ﺑﺮاbura, bad).
Adjective (m, s) Inflected for gender (f) Inflected for number (m, p)
“( ”اﭼﮭﺎ ﺑﭽہacha bacha, good kid) “( ”اﭼﮭﯽ ﮐﺎرache car, good car) “( ”اﭼﮭﮯ دنachay din, good days)
“( ”ﻟﻤﺒﺎ ﺑﭽہlamba bacha, tall kid) “( ”ﻟﻤﺒﯽ ﮐﺎرlambee car, long car) “( ”ﻟﻤﺒﮯ دنlambay din, long days)
“( ”ﺑﺮا ﺑﭽہbura bacha, bad kid) “( ”ﺑﺮی ﮐﺎرburee car, bad car) “( ”ﺑﺮے دنburay din, bad days)
b) Agreement in case: Urdu nouns have three cases; oblique, nominative and vocative.
The adjectives that qualify an oblique noun also become oblique.
The masculine-singular suffixes (a) and (an) are replaced by, (ay) and (ayn), respectively.
The feminine adjectives remain the same as shown in Table 4.4.
Masculine Feminine
Nominative “( ”ﭼﮭﻮﭨﺎchota, little) “( ”ﭼﮭﻮﭨﯽchotee, little)
“( ”ﺳﺎﺗﻮاںsatwan, seventh) “( ”ﺳﺎﺗﻮﯾﮟsatween, seventh)
Oblique “( ”ﭼﮭﻮﭨﮯchotay, little) “( ”ﭼﮭﻮﭨﯽchotee, little)
“( ”ﺳﺎﺗﻮﯾﮟsatwayn, seventh) “( ”ﺳﺎﺗﻮﯾﮟsatween, seventh)
Vocative “( ”ﭼﮭﻮﭨﮯchotay, little) “( ”ﭼﮭﻮﭨﯽchotee, little)
“( ”ﺳﺎﺗﻮﯾﮟsatwayn, seventh) “( ”ﺳﺎﺗﻮﯾﮟsatween, seventh)
c) Adjectives with noun sequences: Sometimes adjectives appear in a sentence with more
than one noun or multiple nouns making a sequence. In this case the nouns may differ in
gender and number.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 4| SentiUnits: The Appraisal Expressions 53
The adjective agrees with the noun, which is nearest to it. Examples are given in Table
4.5, in which, “( ”ﺑﮍاbara, big) inflects for “( ”ﭘﻠﻨﮓpalang, bed) and “( ”ﭼﮭﻮﭨﯽchoti,
younger) inflects for “( ”ﺧﺎﻟہkhala, aunt).
Descriptive Adjectives: These are the most frequent and important type of adjectives.
They describe attributes of the noun they qualify in terms of its size, dimensions, sound,
color, shade, shape, quality, personal trait, or time, etc.
Some examples of descriptive adjectives in Urdu are given in Table 4.7, where, “”ﭼﮭﻮﭨﺎ
(chota, little) and “( ”ﻟﻤﺒﺎlamba, long) describe the size of a noun, and “( ”ﭘﯿﻼpeela,
yellow) and “( ”ﺳﺮخsurkh, red) express the color.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 4| SentiUnits: The Appraisal Expressions 54
Category Examples
Size “( ”ﭼﮭﻮﭨﺎchota, little), “( ”ﻟﻤﺒﺎlamba, long)
Color “( ”ﭘﯿﻼpeela, yellow), “( ”ﺳﺮخsurkh, red)
Shape “( ”ﻣﺮﺑﻊmuraba, square), “( ”ﺗﮑﻮﻧﺎtikona, triangular)
Personal trait “( ”اداسudaas, sad), “( ”ﻣﺠﺒﻮرmajboor, helpless)
Qualities “( ”ﻣﮩﺮﺑﺎنmehrbaan, kind), “( ”اﭼﮭﺎacha, good )
Predicative Adjectives: When the adjectives are used predicatively, they bring in new
information about the noun instead of modifying it.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 4| SentiUnits: The Appraisal Expressions 55
yellow) identify the color of the noun “(”ﻏﺒﺎرهghubara, balloon). Only a specific feature
of the noun is described both parts of speech, i.e., adjective and noun remain in their
individual role. Some more examples are given in Table 4.9.
Possessive Adjective: Possessive adjectives are used to indicate the possession. This
possession relation is realized in two ways; whether, adjectives precede the head noun as
modifiers in noun phrases like the attributive adjectives or they may be preceded by a
suitable form of the genitive postposition “(”ﮐﺎka, of), “( ”ﮐﯽkee, of), and “( ”ﮐﮯkay, of).
These genitive postpositions are lexically independent like “of” in English, but they agree
in number and gender with the object noun. Consider the first example from Table 4.10,
“( ”ارﺗﻀﯽ ﮐﺎ ﭘﯿﻼ ﻏﺒﺎرهIrtaza ka peela ghubara, Itraza’s yellow balloon). In this example the
genitive postposition “( ”ﮐﺎka, of) is used with a singular masculine noun, i.e., “”ﭘﯿﻼ ﻏﺒﺎره
(peela ghubara, yellow balloon). In the second example, “( ”ﻣﯿﺮیmeri, my) is a
possessive adjective which is used for the first person and in this case is inflected for
gender. Third example also contains the genitive postposition “( ”ﮐﺎka, of) with a singular
masculine noun.
Examples
“( ”ارﺗﻀﯽ ﮐﺎ ﭘﯿﻼ ﻏﺒﺎرهIrtaza ka peela ghubara, Itraza’s yellow balloon)
“( ”ﻣﯿﺮی اداس ﭼﮍﯾﺎmeri udaas chiria, my sad sparrow)
“( ” اﯾﺮان ﮐﺎ ﻣﮩﺮﺑﺎن ﺑﺎدﺷﺎهIran ka mehrbaan badshah, kind king of Persia)
Adjectives Examples
“( ”اﯾﺴﺎaisa, like this) “( ”اﯾﺴﺎ ﻟﺒﺎسaisa libas, the dress like this)
“( ”وﯾﺴﺎwaisa, like that) “( ”وﯾﺴﺎ ﻟﺒﺎسwasisa libas, the dress like that)
“( ”ﺟﯿﺴﺎjaisa, such as) “( ”ﺟﯿﺴﺎ ﻟﺒﺎسjaisa libas, such dress)
“( ”ﮐﯿﺴﺎkaisa, how) “( ”ﮐﯿﺴﺎ ﻟﺒﺎس؟kaisa libas, what kind of dress)
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 4| SentiUnits: The Appraisal Expressions 56
As shown in Table 4.11, the Urdu demonstrative pronouns are different for near “”اﯾﺴﺎ
(aisa, like this), far “( ”وﯾﺴﺎwaisa, like that), relative “( ”ﺟﯿﺴﺎjaisa, such as) and
interrogative “( ”ﮐﯿﺴﺎkaisa, how) demonstrations. These demonstrative adjectives inflect
to agree with the noun for gender and number. These inflections are shown in Table 4.12.
Reflexive possessive adjective: The reflexive possessive adjectives are very frequently
used in agreement with the noun they qualify, i.e., they inflect for gender, number and
case. For example, “( ”اﭘﻨﺎapna, own), “( ”اﺳﮑﺎuska, someone else’s) and “( ”اﺳﮑﺎiska,
someone else’s) are used to indicate one’s own, someone else’s far, and someone else’s
near. The examples of the reflexive possessive adjective “( ”اﭘﻨﺎapna, own) are given in
Table 4.13, it is inflected for gender as “( ”اﭘﻨﯽ ﭼﺎﺑﯽapni chabee, one’s own key) and for
number as “( ”اﭘﻨﮯ ﻟﻮگapnay loag, one’s own people).
4.2. Modifiers
The modifiers intensify the orientation of an adjective. These can be absolute,
comparative or superlative. The modifiers made by postpositions are very frequent in
Urdu writing. For example, the absolute adjective “( ”ﻣﮩﻨﮕﺎmehnga, expensive) is
modified by the postposition “ ”ﺳﮯto make it comparative; “( ”اس ﺳﮯ ﻣﮩﻨﮕﺎis say mehnga,
more expansive). Also, the postposition “ ”ﺳﺐ ﺳﮯresult into a superlative expression;
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 4| SentiUnits: The Appraisal Expressions 57
“( ”ﺳﺐ ﺳﮯ ﻣﮩﻨﮕﺎsab say mehnga, most expansive). Some Persian loan words are also
commonly used in inflected forms. For example, “( ”ﮐﻢkam, less) is absolute and is
inflected to make comparative “ ”(ﮐﻤﺘﺮkamtar, lesser) and superlative “”ﮐﻤﺘﺮﯾﻦ
(kamtareen, least) expressions. Detailed examples of modifiers are given in Table 4.14.
These examples are further elaborated below for the noun “( ”ﻟﺒﺎسlibaas, dress):
a). Absolute
“-”ﯾہ ﻟﺒﺎس ﻣﮩﻨﮕﺎ ﮨﮯ
Yeh libaas mehnga hay.
This dress is expensive.
b). Comparative
There are two possibilities whether to use “say” or “say zyadah” for comparison between
two objects.
“-”ﯾہ ﻟﺒﺎس اس ﺳﮯ ﻣﮩﻨﮕﺎ ﮨﮯ
Yeh libaas us say mehnga hay
This dress is more expensive than that.
or
“-”ﯾہ ﻟﺒﺎس اس ﺳﮯ زﯾﺎده ﻣﮩﻨﮕﺎ ﮨﮯ
Yeh libaas us say zyadah mehnga hay
This dress is more expensive than that.
c). Superlative
For superlatives “sab say” or “sab main”, or “sab say zyadah” are used.
“-”ﯾہ ﻟﺒﺎس ﺳﺐ ﺳﮯ زﯾﺎده ﻣﮩﻨﮕﺎ ﮨﮯ
Yeh libaas sab say zyadah mehnga hay
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 4| SentiUnits: The Appraisal Expressions 58
4.3. Orientation
Orientation describes the positivity or negativity of an expression, e.g. "("اﭼﮭﺎacha, good)
have positive orientation.
4.4. Intensity
This is the intensity of orientation, e.g. “( ”ﺑﮩﺘﺮbehtar, better) “( ”ﺑﮩﺘﺮﯾﻦbehtareen, best).
4.5. Polarity
A polarity mark is attached to each word in the lexicon to show the orientation.
4.6. Negations:
Negation is one of the most frequent linguistic structures that change the word, phrase, or
sentence polarity. Negation is not only limited to the negation markers or particles, like,
not, never, or no, but there are various concepts, which serve to negate the inherent
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 4| SentiUnits: The Appraisal Expressions 59
Sentential Negation:
The negative particles “( ”ﻧﮩﯿﮟnaheen, not), “( ”ﻣﺖmat, don’t) and “( ”ﻧﺎna, no) are used to
express sentential negation. The particle “( ”ﻧﮩﯿﮟnaheen, not) appears before the main
verb, which may or may not be followed by an auxiliary verb. In imperative
constructions, the particles “( ”ﻣﺖmat, don’t) and “( ”ﻧﺎna, no) are used in the preverbal
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 4| SentiUnits: The Appraisal Expressions 60
position. Table 4.15 gives the use of these negation particles before the main verbs; “”ﺟﺎﺗﺎ
(jata, goes) and “( ”ﭘﮍھﻮparho, read).
Examples
“( ”وه ﺳﮑﻮل ﻧﮩﯿﮟ ﺟﺎﺗﺎ ﮨﮯwho school naheen jata hay, He doesn’t go to the school.)
“( ”ﮐﺘﺎب ﻣﺖ ﭘﮍھﻮkitaab mat parho, Don’t read the book.)
“( ”ﮐﺘﺎب ﻧﺎ ﭘﮍھﻮkitaab na parho, Don’t read the book.)
Constituent Negation:
The constituent negation is used to negate some particular constituent/constituents of a
sentence. Usually the negative particle comes after the negated constituent. Some
common constituent negation particles are; “( ”ﻧﮩﯿﮟnaheen, not), “( ”ﻣﺖmat, don’t), “”ﻧﺎ
(na, no), “( ”ﻋﻼوهilaawa, except), “( ”ﺳﻮاsiva, except) and “( ”ﺑﻨﺎbina, without). In Table
4.16, the negation particles, “( ”ﻧﮩﯿﮟnaheen, not), “( ”ﻣﺖmat, don’t), “( ”ﻧﺎna, no) and “”ﺳﻮا
(siva, except) are used after the negated constituent.
Examples
“( ”ﮐﯿﻤﺮه ﮐﺎﻻ ﻧﮩﯿﮟ ﻧﯿﻼ ﮨﮯcamera kala naheen neela hay,
camera is blue, not black)
“ﻧﺎ ﺧﺮﯾﺪواﻧﺎر ﺧﺮﯾﺪو/( ”اﻧﮕﻮر ﻣﺖangoor mat/na khareedo anar khareedo,
don’t buy grapes, buy pomegranate)
“( ”ﻣﻮﺑﺎﯾﻞ ﮐﮯ رﻧﮓ ﮐﮯ ﺳﻮا ﺳﺐ اﭼﮭﺎ ﮨﮯmobile kay rang kay siwa sab acha hay,
everything is fine with the mobile except its color)
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 4| SentiUnits: The Appraisal Expressions 61
In the coordinate structures the negation particle does not move to the coordinate point,
unless the identical element is deleted from the second negative conjunct. But, in the
situation like ‘neither … nor’, it appears in the beginning position. For example, “ ﻧﺎ ﮔﮭﺮ ﻧﯿﺎ
( ”ﮨﮯ ﻧﺎ ﮨﻮادارna ghar nya hay, na hawa daar, The house is neither new and nor
ventilated).
Hence, in Urdu negation particles exist at both levels, i.e., sentential and constituent, like
in English, but their use in the sentence structure is different.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 4| SentiUnits: The Appraisal Expressions 62
The model is grammatically motivated and works on the grammatical structure level of
the sentences. It uses a sentiment-annotated lexicon based approach for the identification
of such expressions from the corpuses of Urdu text based reviews (see Figure 4.2). The
adjectives, their modifiers and polarity shifters like explicit negation particles, e.g., “”ﻧﮩﯿﮟ
(naheen, not), “( ”ﻣﺖmat, no), “( ”ﻧﺎna, no) etc, are handled within these expressions.
For a given Urdu language based review, the SentiUnit extraction and polarity
computation takes place in three phases.
a. Firstly, the normalized text is passed to the parts-of-speech (POS) tagger, which
assigns POS tags to all the terms. Along with this tagging the word polarities are also
annotated to the subjective words. This polarity annotation takes place with the help
of the sentiment annotated lexicon of the Urdu text.
b. These annotated subjective terms (adjectives) are considered as the headwords for the
next phase in which shallow parsing is applied for phrase chunking and the adjectival
phrases are chunked out. Now, these chunks are converted into SentiUnits by
attaching the negation, modifiers, conjunctions, etc.
c. In the last phase, the identified SentiUnit are analyzed for polarity computation. The
polarity of the subjective terms is treated with the combined effect of the negation, if
it exists in the SentiUnit. Hence, the overall sentiment or impact of the SentiUnit is a
combination of its constituents.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 4| SentiUnits: The Appraisal Expressions 63
In both cases they describe the characteristic features of the noun they qualify. The
following section describes the characteristics and structure of the noun phrases in the
Urdu language.
aNominative: There is
no case marker with
NP; the noun is in
nominative case
bErgative: NP marked
with case marker “ ﻧﮯ
” (ne) in an actor role
cDative: NP marked
with “( ”ﮐوko) in an
indirect object or
receiver role
dAccusative: NP
marked with “( ”ﮐوko)
Figure 4.3 Cases of noun phrases
in a direct with
object rolecore case markers
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 4| SentiUnits: The Appraisal Expressions 64
A possession marker indicates that in a noun phrase the first nominal is the possessor
or holder of the second nominal.
The second nomainal in the noun phrase change the form of the possession marker. It
means the first nominal is in the oblique form and the second is with the number-
gender agreement. For example, in the noun phrase “( ”ﻓﻠﻢ ﮐﺎ ﻧﺎمfilm ka naam, name of
the movie), the possession marker “( ”ﮐﺎka) agrees with the second noun “”ﻧﺎم, which
is singular masculine.
As the possession markers are not restricted by a verbal predicate, so they do not directly
mark a grammatical function.
Example:
The following sentence contains a complex noun phrase.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 4| SentiUnits: The Appraisal Expressions 65
Description:
In this sentence a complex noun phrase is used which is based on three nouns “”ارﺗﻀﯽ
(Irtaza, proper noun), “( ”ﮐﮭﻠﻮﻧﺎkhilona, toy) and “( ”روﺑﻮٹrobot, robot) with a possession
marker “( ”ﮐﺎka, of).
= ارﺗﻀﯽ ﮐﺎ ﮐﮭﻠﻮﻧﺎ روﺑﻮٹNP
Irtaza ka khilona robot
The SentiUnit in the sentence is single adjective based with positive orientation, i.e.,
“( ”ﺷﺎﻧﺪارshandar, wonderful).
Chapter review:
SentiUnits are described in Chapter 4 in detail as the sentiment carrier expressions. A
general model used for the identification of a subjective sentence or opinion with
identifiable appraisal expressions is based on three units, i.e., source of appraisal,
appraisal expression, and finally target of appraisal. This model is defined in detail in
next Chapter, where the source of appraisal in a given review is the reviewer and the
target is the entity about which the appraisal is made. For our approach of sentiment
analysis of the Urdu language is grammatically motivated and incorporates a sentiment-
annotated lexicon for the identification of the sentiment carrier expressions or the
appraisal expressions in a sentence.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 5| Implementation: Classification Model and Lexicon Structure 66
CHAPTER 5
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 5| Implementation: Classification Model and Lexicon Structure 67
Figure 5.1 System model representing modules and their interactions (Syed et al 2012).
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 5| Implementation: Classification Model and Lexicon Structure 68
5.8. PREPROCESSOR
In general, for natural language processing applications, the preprocessing phase deals
with; the removal of punctuation marks, or omitting other unnecessary symbols and
striping of HTML tags. In addition to these tasks, our PREPROCESSOR module has to
handle the diacritics and word boundary identification issues, which are specific to Urdu
language.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 5| Implementation: Classification Model and Lexicon Structure 69
process becomes complicated, if white spaces or other word delimiters are rarely or never
used as word boundaries.
As we already mentioned in Chapter 4, Urdu orthography is context sensitive. The
“( ”ﺣﺮوفharoof, alphabets) are divided in two categories as joiners and non joiners. The
joiners take multiple glyphs and shapes according to the context, which cause word
boundaries identification issues. The work in (Durrani and Hussain 2010) divides the
word segmentation of Urdu text into two sub problems as, i.e., space insertion and space
deletion.
i) Space-insertion
Many words in Urdu are made by more than one ligature (usually two). Semantically and
syntactically these ligatures are part of a single word. If the last letter of the first ligature
in a word is a joiner then it tends to join with the first letter of the second ligature. To
avoid this joining, a space is inserted by the writer.
This causes space insertion errors, e.g., “( ”ﺧﻮش ﺑﺎشkhush bash, happy), is a single word
with two ligatures, L1= “ ”ﺧﻮشand L2 = “”ﺑﺎش. The last letter of L1 “ ”شis a joiner which
tends to join with first letter in L2 “ ”بto avoid this joining a space is inserted while
typing the word. On omitting this space we get “”ﺧﻮﺷﺒﺎش, whish is not a correct word, so
the writer cannot avoid the space.
ii) Space-omission
There are many words which end with non-joiner letters. As the non-joiner letters keep a
constant shape so usually the writers do not insert spaces while writing the next word to
identify word boundary. This does not affect the readability of the words but for
computational tasks the boundary identification becomes an issue as both words are
written in continuation without space. For example, the phrase, “( ”ﺷﯿﺮاورﺑﮑﺮیshair aur
bakri, lion and goat) is written without, and “( ”ﺷﯿﺮ اور ﺑﮑﺮیshair aur bakri, lion and
goat) is with spaces. We rewrite the phrase with the symbol “|” to indicate the word
boundaries “( ”ﺷﯿﺮ| اور| ﺑﮑﺮیshair aur bakri, lion and goat).
For the Urdu language, the word segmentation is handled by most of the researches as the
part of a major task, i.e., morphological analyzer, POS tagger, and translators etc. A few
contributions dealt with this issue as an independent task, for example, (Durrani and
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 5| Implementation: Classification Model and Lexicon Structure 70
Hussain 2010), (Lehal 2010) and (Lehal 2009). Particularly, (Durrani and Hussain 2010)
presents; a detailed literature survey for the identification of the inherent causes and then
propose a word segmentation model.
According to the above discussion and the previous realized works, we propose to
perform the PREPROCESSOR task in four steps, as shown in Figure 5.2. First of all the
normalization is performed on the given text for the removal of symbols and tags. Then,
diacritic omission is performed to avoid ambiguity. Thirdly, the sentence is tokenized as
a sequence of orthographic words OW = ow1, ow2… own, where the words ow1, ow2, ...
are not grammatical or meaning full words but these are only orthographically separated
from each other.
This sequence becomes the input to the final segmentation module. The result of
segmentation is a sequence of meaning full and grammatically correct words ready for
further processing.
Figure 5.2 Preprocessing of the input sentence by the PREPROCESSOR module (Syed et al 2012).
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 5| Implementation: Classification Model and Lexicon Structure 71
5.9. EXTRACTOR
The EXTRACTOR module identifies and extracts the SentiUnits and the targets. Two
subtasks are performed:
Extracting SentiUnits with Adjectives as head words
Extracting targets with Nouns as head words
The extractor module uses shallow parsing based text chunking. This method identifies
the beginnings and ends of grammatical phrases without parsing the full phrase structure.
Hence, the EXTRACTOR shallow parse each sentence in the given review to find
adjective or noun phrases and then work out for attributes (modifiers, orientation,
intensity, etc.) modeling the behavior of the modifiers and the negations within the
phrase.
Figure 5.3 Processing of the input sentence by EXTRACTOR module (Syed et al 2012).
For extracting SentiUnits, the parser starts with a lexicon of nominal and adjectival head
words, which define initial values for orientation whether positive or negative. In addition
to positive or negative orientation head words exhibit the intensity of orientation. It
searches for occurrences of these head words in the sentence, and upon finding them it
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 5| Implementation: Classification Model and Lexicon Structure 72
moves rightward to attach modifiers because the modifiers appear in the right side of the
adjectives in Urdu. Now, it searches for the polarity shifters or negations and finally
distinguishes the whole subjective expression. Likewise the parser identifies candidate
targets with the help of lexicon. It finds the entire target groups matching words specified
in the lexicon. These steps are given in Figure 5.3.
5.10. ASSOCIATOR
Figure 5.4 The dependency parsing of the given sentence (Syed et al 2012).
The extracted SentiUnits and targets are associated with each other through
ASSOCIATOR. We apply dependency parsing for this purpose. Figure 5.4 shows the
dependency parsing of the sentence;
“”ﻟﮍﮐﺎ ﮐﻤﭙﯿﻮﭨﺮ اور اﻟﯿﮑﭩﺮوﻧﮑﺲ ﮐﯽ ﭼﯿﺰﯾﮟ ﺑﯿﭽﺘﺎ ﮨﮯ
larka computer aur electronics kee cheezain baichta hay.
The boy sells computer and electronic products.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 5| Implementation: Classification Model and Lexicon Structure 73
First the nominal group that is the lexical representation of the target is identified and
then the values of the attributes describing that target are computed. ASSOCIATOR finds
the target phrase by following the paths through a dependency parse of the sentence. The
result of the dependency parse is a ranked list of paths or linkage specifications. These
specifications are ranked to specify the order in which the links should be traversed. For
each SentiUnit, the system looks for the paths through the dependency tree which
annotate any word in the SentiUnit to the next or final expected word according to the
specification of that particular link. With the identification of a word in the proper
syntactic place, the shallow parsing is applied moving rightward to find a noun phrase
that ends in the identified word. These steps are shown in Figure 5.5.
Figure 5.5 Linking SentiUnits with candidate targets by ASSOCIATOR module (Syed et al 2012).
5.3.2. Algorithm
Hence, the steps performed by ASSOCIATOR are:
Input: Shallow parsed sentence with extracted SentiUnits and targets.
Processing: Apply dependency parse and then,
1. Search all the linkages such that;
a. The linkage is in the linkage specifications
b. The linkage connects to a chunked SentiUnit
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 5| Implementation: Classification Model and Lexicon Structure 74
5.11. CLASSIFIER
The CLASSIFIER starts from calculating the intensity of orientation of the SentiUnits by
comparing each tagged word with the polarity values assigned in the lexicon entries. For
example, the expression “( ”ﺑﮩﺖ اﭼﮭﯽ ﮐﺘﺎبbohat achi kitab, very good book) is more
intense than “( ”اﭼﮭﯽ ﮐﺘﺎبachi kitab, good book) due to the modifier “”ﺑﮩﺖ, (bohat, very)
and both are positive expressions. In this expression, the SentiUnit “( ”ﺑﮩﺖ اﭼﮭﯽbohat
achi, very good) is associated with the target “( ”ﮐﺘﺎبkitab, book).
The CLASSIFIER look for other associations identified by the ASSOCIATOR, then it
calculates the polarity value for each association for a particular target, e.g., “( ”ﮐﺘﺎبkitab,
book) in this case. If “( ”ﺑﮩﺖ اﭼﮭﯽbohat achi, very good) is the only expression in the
sentence showing sentiments about the target then the sentence polarity is equal to the
polarity of this expression otherwise other possible expressions are also evaluated. The
calculation of polarity is summation of either positive or negative expressions with
positive or negative values respectively.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 5| Implementation: Classification Model and Lexicon Structure 75
5.4.2. Algorithm:
Hence, the CLASSIFIER module is divided into two steps as given next,
Step1: Compute sentence polarity
Input: Dependency parsed sentence with SentiUnits to targets associations.
Processing: Start with any one SentiUnit of a particular target
a. COMPARE each word in the SentiUnit with the lexicon to find its orientation and
polarity value;
b. COMPUTE SentiUnit polarity by adding polarities of the words according to the
intensity values
c. LOOK FOR another SentiUnit for the same target
d. Sentence polarity = SUMMATION of all SentiUnits’ polarities for a particular
target
Step2: Compute total polarity of review
a. REPEAT step 1 for all sentences
b. ADD all polarity values to calculate PR.
c. COMPARE with threshold
Case a: If PR > threshold, then R is positive.
Case b: If PR < threshold, then R as negative
Output: Classification of review as positive or negative.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 5| Implementation: Classification Model and Lexicon Structure 76
For the purpose of sentiment classification, the classifier is integrated with the lexicon of
annotated words (discussed in Section 5.6). In such a lexicon, a polarity mark is
annotated with each lexical entry to show its orientation and intensity. This is called the
prior polarity of the subjective words and phrases. The overall orientation of a sentence
is calculated by recognizing the prior polarities of the constituent subjective terms. This
idea works well in some simple sentences, particularly, if the polarity shifters are not
present. The polarity shifters are the words and phrases, which can change the prior
polarities of the words in a sentence.
Example:
Consider the sentence:
“”ﻣﯿﺮا ﮐﯿﻤﺮا ﮐﻢ ﻗﯿﻤﺖ ﮨﮯ ﻟﯿﮑﻦ اﺳﮑﯽ ﺑﯿﭩﺮی دﯾﺮﭘﺎ ﻧﮩﯿﮟ
mera camera kam-qeemat hay laykin iski battery derpa naheen
My camera is inexpensive but its battery is not long lasting
Description:
In this sentence, the word “( ”دﯾﺮﭘﺎderpa, long lasting) have positive prior polarity, but due
to the use of polarity shifter “( ”ﻧﮩﯿﮟnaheen, not), its overall contribution to the sentence’s
sentiment becomes negative. Another example of the polarity shifter in the above
expression is the word “( ”ﻟﯿﮑﻦlaykin, but), which, alters the positive prior polarity of the
word “( ”ﮐﻢ ﻗﯿﻤﺖkam-qeemat, inexpensive). This overall polarity of the appraisal
expression is named as the SentiUnit polarity. Therefore, our approach of sentiment
classification rests on two types of polarity scores:
Prior polarity: Polarity marks annotated with the lexicon entries.
SentiUnit polarity: The overall polarity of the appraisal expression on which the final
polarity of the sentence depends
At the highest level, our lexicon model categorizes all the lexical entries into objective
terms and the subjective terms. Objective terms have no orientation or intensity and
hence are not marked with the prior polarity scores. Therefore, they demonstrate no effect
on the overall decision of the classification. On the contrary, subjective terms are the
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 5| Implementation: Classification Model and Lexicon Structure 77
carriers of the sentiments and are marked with polarity scores. Their occurrence can
effect or even altogether alter the final classification decision.
The algorithm identifies the subjective words according to the prior polarities, annotated
in the lexicon. Then, it attaches the polarity shifters, conjunctions, postpositions and
modifiers to extract the appraisal expressions in the opinionated sentences. These
expressions are labeled as the SentiUnits. The shallow parsing based chunking is applied
for the extraction of the SentiUnits, with adjectives as the head words. The overall
polarity of a sentence in a given review can be determined by computing the polarity of
these expressions. Let us denote the term’s prior polarity with Tp, SentiUnit’s polarty
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 5| Implementation: Classification Model and Lexicon Structure 78
with SUp, Sentence polarity with Sp, and overall review polarity with Rp, as shown in
Figure 5.6 and 5.7.
The Figure 5.7 shows the overall process of the review polarity calculation.
Figure 5.7 Computation of the overall polarity of the Urdu text based review (Syed et al 2011 a)
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 5| Implementation: Classification Model and Lexicon Structure 79
When the system is given a review for classification it sets the review polarity Rp and
sentence count SCount to zero. Then, its takes each sentence one by one. The analysis
begins with the text normalization resulting into word segmentation. These words are
passed to the SentiUnit extraction and polarity computation module, which gives polarity
annotated SentiUnits. Now, the sentence polarity Sp is computed using the polarities of
its constituent SentiUnits. The total Rp is the sum of all known sentence polarities Sp.
Then, Rp is compared with the threshold value. If Rp is greater than the threshold, then,
the review is positive and vice versa.
Natural language processing applications use electronic versions of the lexicons or the
machine readable versions. The lexical level require this lexicon, and the particular
approach adopted by the system decides whether a lexicon will be employed, as well as
the extent, nature and level of information that is encoded in that lexicon.
Lexicons may be relatively simple, with only the words and their lexical category (part of
speech), or may be increasingly intricate and include information about the semantic
classes of the word, its arguments, the semantic limitations on these arguments,
definitions of the sense or senses in the semantic representation employed in a certain
system, and it can even hold each sense of a single word for word sense disambiguation.
A usual model of a sentiment analyzer with a sentiment-annotated lexicon incorporates
two components:
(i) The classification model, which analyzes and classifies the given opinionated text
according to inherent sentiments of the reviewer (given in previous sections), and
(ii) The lexicon or lexicons annotated with the prior polarities of the lexical entries
(words/ phrases), usually as positive or negative.
These prior polarity annotated lexicons are also called sentiment-annotated lexicons
(Pang and Lee, 2008). These can be manually compiled like General Inquirer (Stone et al.
1966), a prominent recourse used in English sentiment analysis based research and
applications. Alternatively, such lexicons can be automatically generated. A considerable
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 5| Implementation: Classification Model and Lexicon Structure 80
Definition1:
In linguistics a lexicon is defined as the set of all the morphemes of a particular language.
More specifically it can be a collection of terms used in a particular profession, subject,
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 5| Implementation: Classification Model and Lexicon Structure 81
Definition 2:
A lexeme is a conceptual unit of the morphological analysis, which corresponds to a set
of the forms taken by a single word. Generally, a lexeme belongs to a specific syntactic
class and has a definite semantic value. In case of inflecting languages (such as Arabic,
Urdu, Turkish etc), it has a related inflectional paradigm, so, a lexeme in many languages
will have many different forms.
Example:
As an example, consider the lexeme WALK from the English language; this lexeme have
different forms i.e., walk, walks, walked and walking.
The grammar rules of a language govern the forms of the lexemes, which include,
compound tense rules and subject-verb agreement. For example, walks is the present third
person singular form of the lexeme WALK, whereas, walked is its past form.
Definition 3:
The morphology (defined and discussed in Chapter 3) is also based on the notion of the
lexeme, which, further describes many other terms. For example, in terms of lexemes the
morphological operations (already defined in Chapter 3) can be stated as; inflectional
rules relate a lexeme to its forms and derivational rules relate a lexeme to another
lexeme.
Definition 4:
In dictionaries, conventionally, a lexeme is presented as the lemma, which is a canonical
form of a lexeme and is used as the headword. Other forms of the lexeme that are not
common conjugations of the word are often listed later in the lexical entry.
Definition 5:
A lexical entry is a single word or chain of words that formulates the basic elements of a
lexicon. The single word lexical entries are lion, computer and finger. Whereas, traffic
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 5| Implementation: Classification Model and Lexicon Structure 82
signal, life style, bits and pieces, and take care of , etc are the examples of the chains of
the words.
Much as a lexeme, the lexical entries generally, express a distinct meaning but are not
limited to single words.
i) Construction Steps
We divide the lexicon construction task into following steps:
Categorize the words either subjective or objective. We have identified two categories
of lexicon entries. When we apply classification algorithm on these words then the
classifier simply ignores objective terms, in this way its performance totally depends
upon subjective words. For example, “( ”ﻣﻮمmome, wax) is an objective and “”ﻋﻤﺪه
(umdah, fine) is a subjective word.
Categorize these words according to morphological rules, which work at the word
level. This categorization helps in identifying the subjective terms from the given
text. For example, rules for marking of an adjective with the noun it qualifies, etc.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 5| Implementation: Classification Model and Lexicon Structure 83
Identify their grammatical rules, which describe the possible structures of a sentence
and position of the parts of speech with respect to each other. For example, use of
modifiers with adjectives or use of auxiliaries with verbs, etc.
Discover relationships between different lexicon entries. These relationships can
define synonyms, antonyms and cross references, etc.
Decide and annotate polarities and then intensities to the entries. In this task first the
entries are categorized as positive or negative then their intensity scores are attached
to them. Some entries have only orientations and some have only intensities (like
modifiers) and some have both values.
Intensity: This is the intensity of orientation of a lexicon entry. This describes the force of
positivity or negativity of a term. Usually, the modifiers, e.g., “( ”ﺑﮩﺖbohat, more)
describe the intensity of an expression. Like other languages, in Urdu there are three
degrees of intensity; absolute (only positive or negative orientation), comparatives (two
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 5| Implementation: Classification Model and Lexicon Structure 84
distinct entities are compared with each other) and superlative (one of all entities is with
highest orientation)
Polarity: The polarity mark is annotated to each lexicon entry to show its orientation and
intensity.
a) Absolute subjective terms with orientation only T (O): For such terms, there are only
two possible values, i.e., “+1”, for absolute positive, and “-1”, for absolute negative.
Examples: Absolute Urdu adjectives come in this category, e.g, the adjectives,
“( ”ﺧﻮﺑﺼﻮرتkhoobsurat, beautiful) and “( ”ﺑﮩﺎدرbhadur, brave) both have positive
orientation and are marked with prior polarity = +1. Whereas, “( ”ﮔﮭﭩﯿﺎghatya, cheap) has
prior polarity = -1, due to its negative orientation.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 5| Implementation: Classification Model and Lexicon Structure 85
Figure 5.8 Structure of the sentiment annotated lexicon (Syed et al 2011 c).
b) Subjective terms with intensity only T (I): For such terms the prior polarity is assigned
with respect to the possible intensity values, showing the degrees of the polarity, i.e., 1
for absolute, 2 for comparative and 3 for superlative.
Examples: The adjective modifiers are basically the terms with intensity only, e.g., both
the modifiers, “( ”ﺑﮩﺖbohat, mush) and “( ”زﯾﺎدهzyadah, more) have prior polarity = 2.
And “( ”ﺳﺐ ﺳﮯ زﯾﺎدهsab say zyadah, most) has prior polarity = 3.
c) Subjective terms with both values of orientation and intensity T (I, O): In this case the
prior polarity is calculated by multiplying the orientation score (+1 or -1) with the
intensity score (1, 2 or 3).
Examples: In Urdu language, very few terms come in this category, For example, the
words “( ”ﺑﮩﺘﺮbehter, better), “( ”ﺑﮩﺘﺮﯾﻦbehtareen, best) and “( ”ﺑﺮﺗﺮbadtar, worse) have
prior polarities = +2, +3 and -2, respectively. These are usually Persian loan words.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 5| Implementation: Classification Model and Lexicon Structure 86
The annotated lexicon of Urdu words is integrated with the sentiment classifier as shown
in Figure 5.8. First of all, the given text in the form of a review is taken from the website.
The sentiment classifier component of the systems preprocesses this review, segments it
into sentences and then words. These words are then tagged with the respective parts of
speech. Now, these tagged words are compared with the lexicon entries for sentiment
orientations and intensities. This comparison results into polarity marked or polarity
annotated words and phrases.
POS Tagged
Given review in Urdu text words/phrases
(review) Sentiment
Sentiment
Classifier
annotated lexicon
Classification of Urdu
Polarity-annotated
words
Website words/phrases
Figure 5.9 Integration of the lexicon of Urdu words with the sentiment classifier (Syed et al 2010).
On the basis of the polarities of individual words, the sentence and then its total review
polarity is calculated. We evaluate the overall system using a corpus of movie reviews in
Urdu language; the experimentation is given in Chapter 6. The classification algorithm is
applied on the review corpus. Each subjective word in the review is compared with
lexicon entries for the computation of the polarity scores.
Chapter review:
This Chapter has presented our approach in detail, as well as the modules of the system:
PREPROCESSOR, EXTRACTOR, ASSOCIATOR, and CLASSIFIER (see Figure 6.1).
Each module is described by a separate detailed model. In next section, we evaluate our
model through experimentation.
As a pioneering effort, in this research we describe the structure, construction and
evaluation of a manually tagged sentiment-annotated Urdu words based lexicon as a
component of a sentiment analysis model developed for Urdu text. The lexicon contains
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 5| Implementation: Classification Model and Lexicon Structure 87
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 6| Experimentation and Results 88
CHAPTER 6
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 6| Experimentation and Results 89
67 1,368 1,920
6.2. Corpus
Due to the deficiency of publicly accessible corpus of the Urdu language based reviews,
we collect two corpora of reviews to evaluate the efficacy of the employed model. The
first corpus C1 is the collection of 700 movie reviews, among which 385 are positive and
315 are negative. The average document length in this corpus is 264 words. For obtaining
variant reviews, 40 different movies with different popularity scores (already known) and
categories (comedy, drama, historical etc) are given for review.
The second test-bed is a corpus of reviews of the electronic appliances C2. This corpus
comprises a total of 650 reviews with 322 positive and 328 negative. The base collection
has the reviews for three types: refrigerators (237), air-conditioners (250), and televisions
(163). The average review length is 196 words. For achieving diversity, 9 different
brands of the electronic appliances are given for review.
For both corpora, the reviews within the threshold boundary or with neutral scores are
removed. Hence, the data set contains either positive or negative reviews as shown in
Table 6.2.
Positive 385
Movies C1 700 264 words
Negative 315
Positive 322
Electronic appliances C2 650 196 words
Negative 328
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 6| Experimentation and Results 90
Before proceeding to the results, we consider different case studies from the Urdu text
and show how the system processes, like; POS tagging, extraction, association and
polarity annotation are performed.
ﻟﻮگ ﺳﺎﺗﮭ آﺗﮯ ﮔﮱ اور ﮐﺎرواں ﺑﻨﺘﺎ ﮔﯿﺎ۔ ﻣﯿﮟ اﮐﯿﻼ ﮨﯽ ﭼﻼ ﺗﮭﺎ ﺟﺎﻧﺐ ﻣﻨﺰل ﻣﮕﺮ
(main akela hi chala tha jaanib-e-manzil magar, log saath aate gaye aur kaaravaan bantaa
gayaa, I had started all alone towards the destination, but; people kept joining and it
became a caravan.)
<SC> <ﻣﮕﺮN> < ﻣﻨﺰلADJ> < ﺟﺎﻧﺐTA> < ﺗﮭﺎVB> < ﭼﻼADV> < ﮨﯽADJ> < اﮐﯿﻼPP>ﻣﯿﮟ
In this sentence, both the SentiUnit and the target are complex, i.e., they are composed of
more than one word. The SentiUnit (ﺑﮍا ﺷﺎﻧﺪارbarashaandaar, very fabulous) is made by
an adjective head word and a positive modifiers. The target of the comment “ ارﺗﻀﯽ ﮐﺎ
”(روﺑﻮٹIrtaza ka robot, Irtaza’s robot) is based on three words; two nouns with a
possession marker in between, as shown in Table 6.3.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 6| Experimentation and Results 91
Remark Parse
Example 2:
“”ارﺗﻀﯽ اورﻓﺎطﻤہ ﮐﺎ ﮐﻤﺮه ﮨﻮادارﻧﮩﯿﮟ
Irtaza aur Fatima ka kamrah hawadar naheen
Irtaza and fatima’s room is not airy
Again, both the SentiUnit and the target are complex. The SentiUnit (ﮨﻮادارﻧﮩﯿﮟhawadar
naheen, not airy) contains an adjective head and a negation word. The target of the
comment is even more complex, i.e., (ارﺗﻀﯽ اورﻓﺎطﻤہ ﮐﺎ ﮐﻤﺮهIrtaza aur Fatima ka
kamrah, Irtaza and fatima’s room) is made by five words; three nouns, a possession
marker and a conjunction. The sentence parse in given in Table 6.4.
Remark Parse
Sentence with complex SentiUnit and [N CJC N PM N] [ADJ ارﺗﻀﯽ اورﻓﺎطﻤہ ﮐﺎ ﮐﻤﺮه ﮨﻮادارﻧﮩﯿﮟ
target NEG] NP SU
Noun phrase with conjunction (CJC) N CJC N PM N NP ارﺗﻀﯽ اورﻓﺎطﻤہ ﮐﺎ ﮐﻤﺮه
and possession marker (PM) (Target)
SentiUnit with negation (NEG) ADJ NEG SU (SentiUnit) ﮨﻮادارﻧﮩﯿﮟ
Example 3:
Here is a short review from Urdu language based movie review corpus.
“”ﻓﻠﻢ ﮐﯽ ﮐﮩﺎﻧﯽ ﺑﻮرﻧﮓ ﮨﮯ۔ ﮨﯿﺮو ﮐﯽ اداﮐﺎری اور ﺷﮑﻞ اﭼﮭﯽ ﻧﮩﯿﮟ۔ ﻧﺎ ﮨﯽ ﮨﺪاﯾﺖ ﮐﺎری ﻗﺎﺑﻞ ﺳﺘﺎﯾﺶ ﮨﮯ۔
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 6| Experimentation and Results 92
Example 4: Let us take an example execution of a single sentence. Figure 6.1 shows the
executions steps in detail:
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 6| Experimentation and Results 93
Description:
In this sentence there are three noun phrases. One of them is complex, i.e., “”ﮐﺜﺮﺳﻠﻄﺎﻧﯽ
(kasr-e sultani, king’s palace). In English translation apostrophe is used as a replacement
of “of”. But in Urdu no indication is visible because the diacritic mark is optional and
mostly ignored. Only the native Urdu readers can understand the right pronunciation and
meaning. This phenomenon is called compounding, which is very common in Urdu texts
(discussed in Chapter 3).
<NEG>ﻧﮩﯿﮟ
NP 1: <PP>ﺗﯿﺮا
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 6| Experimentation and Results 94
Version2:
Let us consider another version of the same sentence:
ﺗﯿﺮا ﻧﺸﯿﻤﻦ ﮐﺜﺮﺳﻠﻄﺎﻧﯽ ﮐﮯ ﮔﻨﺒﺪ ﭘﺮ ﻧﮩﯿﮟ
teranashemankasr-e sultanikaygunbad parnaheen
Your home is not on the tower of the king’s palace
Description:
In this sentence only the word order is changed but the composition of noun phrases
remain the same.
NP 1: <PP>ﺗﯿﺮا
<NN>ﻧﺸﯿﻤﻦ, NP 1 is based on one noun and one adjective.
NP 2: <NN>ﮐﺜﺮ
<ADJ>ﺳﻠﻄﺎﻧﯽ, NP 2 is called compounding of two nouns through diacritic.
<P>ﮐﮯ
NP 3: <NN>ﮔﻨﺒﺪ, NP 3 is simple with single noun.
<P>ﭘﺮ
<NEG>ﻧﮩﯿﮟ
Version3:
Another version of the sentence is
ﺗﯿﺮا ﻧﺸﯿﻤﻦ ﮔﻨﺒﺪ ﮐﺜﺮﺳﻠﻄﺎﻧﯽ ﭘﺮ ﻧﮩﯿﮟ
teranashemankasr-e sultanikaygunbad parnaheen
Your home is not on the tower of the king’s palace
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 6| Experimentation and Results 95
Description:
In this case word order is changed and “( ”ﮐﮯkay, of) is replaced by the diacritic, making
“( ”ﮔﻨﺒﺪgunbad, tower) an additional word in noun phrase, i.e., “( ”ﮔﻨﺒﺪ ﮐﺜﺮﺳﻠﻄﺎﻧﯽgunbad-e
kasr-e sultani, king’s palace’s tower). Therefore, the sentence contains tow noun phrases.
NP 1: <PP>ﺗﯿﺮا
<NN>ﻧﺸﯿﻤﻦ, NP 1 is based on one noun and one adjective.
NP 2: <NN>ﮔﻨﺒﺪ
<NN>ﮐﺜﺮ
<ADJ>ﺳﻠﻄﺎﻧﯽ, NP 2 is called compounding of two nouns through diacritic.
<P>ﭘﺮ
<NEG>ﻧﮩﯿﮟ
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 6| Experimentation and Results 96
Example 2:
“”ﻣﯿﺮی ﮐﺘﺎب ﻋﻤﺪه ﻧﮩﯿﮟ ﮨﮯ
merikitabumdahnaheen hay
My book is not good.
Description:
The SentiUnit is made by an adjective as the subjective term with orientation only and a
negation term as the polarity shifter.
Hence,
SUp = Tp(Neg)…… (2)
Where
Tp = +1 and Neg = -1
Putting this value in equation 2, we get,
Example 3:
“”وه ﺳﺐ ﺳﮯ زﯾﺎده ﺳﺨﯽ ﮨﮯ
woh sab say zyadahsakhee hay
He most generous of all
Description:
SentiUnit is made by four lexical units; an adjective as the subjective term with
orientation only, and a superlative modifier made by three words. The adjective polarity
shifts to the superlative degree due to intensity of the modifier.
Hence,
SUp = (Tp1) (Tp2)……. (3)
Where
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 6| Experimentation and Results 97
The chunker finds “”اﭼﮭﯽas a sentiment expression. The ASSOCIATOR module then
searches for the target noun phrase, which is “”ﺑﺎﻧﮓ درا, the name of the book, as shown
in Figure 6.2.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 6| Experimentation and Results 98
6.4. Results
For evaluating the effectiveness and efficiency of a text classifier only using the accuracy
as the performance metric is not sufficient. Therefore, we use other three metrics; called
the precision P, recall R and F-measure F in addition to the accuracy A. These metrics
can provide much greater insight into the performance features of a classifier.
Definition 1: For a sentiment classifier the accuracy A can be defined as the measure of
how close the document classification suggested by the classifier is, to the actual
sentiments present in the review.
P = tp / (tp + fp)
R = tp / (tp + fn)
F = 2 PR/ (P+R)
A series of four experiments in two sets with two models of the system have been
performed. The model A is the former version of the system with the EXTRACTOR
module only (Syed et al. 2010) and the model B is the final version in which the
ASSOCIATOR module is attached (Syed et al. 2012). By using this testing, the efficacy
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 6| Experimentation and Results 99
and usability of the extended version are easily compared. Both models are applied on
both corpora C1 and C2 separately.
6.4.1. Model A
Table 6.6 and Table 6.7 show the results of the experiments performed by model A on
both corpora C1 and C2. Table 6.6 shows the detailed results with P, R, F and A values
separately computed for positive as well as negative reviews.
Table 6.6. Experimental results in terms of P, R, F and A for model A (Syed et al. 2012)
Table 6.7 shows a comparative summary of the results from both corpora. The accuracy
of C1 is 70% and variation in positive and negative reviews is 8%. Whereas the accuracy
of C2 is 78% and variation in positive and negative reviews is 2%. The total accuracy of
model A is 74%.
Table 6.7. Comparison of accuracy from both corpora C1 and C2 for model A.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 6| Experimentation and Results 100
6.4.2. Model B
For the next two experiments we include ASSOCIATOR module and tested both corpora.
The results are shown in Table 6.8 and Table 6.9. Table 6.8 shows the experimental
results in terms of P, R, F, and A for model B applied on C1 and C2 for positive and
negative reviews separately.
Table 6.8. Experimental results in terms of P, R, F and A for model B (Syed et al. 2012).
Results from Table 6.8 are compared and summarized in Table 6.9. The accuracy of C1 is
improved to 78.5%, and the variation in positive and negative reviews is decreased to
3%. Likewise, the accuracy of C2 is increased to 86.5%. In this case the variation in the
accuracy of positive and negative reviews is also increased to 3%. The total accuracy of
model B is 82.5%.
Table 6.9. Comparison of accuracy from both corpora C1 and C2 for model B.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 6| Experimentation and Results 101
Observations:
From the above results it is clear that the classification accuracy is highly domain
specific. The reviews in C1 are more challenging to classify as compared to those of
electronic appliances in C2. The reason is that these reviews contain more allegory which
results into more divergence, not only syntactic or semantic structure, but also in
appraisal type. Discussion about the movie plot and its characters weather good or evil is
very frequent phenomenon. This discussion results into a number of appraisal targets
which further can lead to the selection of the wrong linkage. On the other hand all
positive or negative comments about the parts of an electronic appliance are indirectly
related to the same target.
Moreover, the classification accuracy also depends upon the orientation of the review.
From results, it is also perceptible that negative reviews are more prone to be
misclassified than the positive ones.
On the basis of the above discussion it is clear that the negation markers affect the results
of the analyzer to much extend, therefore, we carry out experimentation to analyze the
behavior of the negation. For this reason, we divide the dataset into three different sets of
data. During the test-bed normalization process, we clean out the neutral comments from
all the three sets.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 6| Experimentation and Results 102
Set 1: In the Set 1, we include the sentences, in which, both implicit and explicit negation
is absent. The polarity of these sentences depends only on the subjective terms and other
polarity shifters.
Set 2: The Set 2 contains those sentences, in which only explicit negation particles are
used and implicit negation is absent.
Set 3: To compile the Set 3, we add implicit negation sentences in the Set 2. In this set
both implicit and explicit negation is present in addition to polar terms.
The Table 6.10 gives the results from the three sets of data, in terms of precision, recall,
and f-measure. From these values the total performance accuracy is about 77%. The Set 1
in which only polar terms are present, gives the best results of the classification.
Whereas, the results from Set 2 are lower than the previous one, as it contains only the
sentencs with the negation particles. From this result, it is infered that the negation
particles can cause relatively high rate of missclassofication. But, the average accuracy
from Set 1 and Set 2 is quite satisfactory. The results from Set 3 show that the implicit
negation still needs an improved treatment.
Observations:
Apart from the results, we have following worth mentioning observations about
negations:
On the average two to three negation particles appear in a single review and the use of
negation is author dependent; some authors tend to use more negative particles than
others. “( ”ﻧﮩﯿﮟnaheen, not) is the most used particle. In comparative, sentences, the
negation particle “( ”ﻧﺎna, no) is used with multiple targets of the appraisal.
The sentential negation is rarely misclassified as compare to the constituent negation.
Morphological negation is automatically handled, because most of the words
inflected by the lexical negation marks,
e.g., “( ”ﺑﮯbay), “( ”ﺑﺎba), etc, are already present in the lexicon and are annotated
with respective polarities,
e.g., “( ”ﺑﮯ ﻓﺎﯾﺪهbayfayeeda, useless) is a lexical entry with a negative polarity.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 6| Experimentation and Results 103
اس ﮐﯽ ﭘروﺳﯾﺳﻧﮓ ﮐﯽ. ﯾہ اﯾﮏ ﺷﺎﻧدار ﭼﯾز ﮨﮯ.ﭘﭼﮭﻠﮯ ﻣﮩﯾﻧﮯ ﻣﯾں ﻧﮯ اﯾﮏ ﻟﯾپ ﭨﺎپ ﺧرﯾدا ﮨﮯ
. ﺟو ﻣﯾرے ﻟﺋﮯ ﻣﻔﯾد ﮨﮯ، اﮔرﭼہ ﺑﯾﭨری دﯾرﭘﺎ ﻧﮩﯾں ﮨﮯ. آﭘرﯾﭨﻧﮓ ﺳﺳﭨم ﺑﮩﺗرﯾن ﮨﮯ.رﻓﺗﺎر ﺑﮩت ﺣﯾرت اﻧﮕﯾز ﮨﮯ
Translation: Last month I bought a laptop. It is a wonderful thing. Its processing speed is
very amazing. The operating system is the best. Though, its battery is not long lasting.
But this is good for me.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 6| Experimentation and Results 104
Result:
This is a positive review as the result of the analysis shows in Figure 6.4.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 6| Experimentation and Results 105
Translation: After a long time, got a film to watch. The film’s topic is old. This film is
very rubbish and intolerable. Hero’s acting is the best in the movie. It was not a fun to
watch the movie.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 6| Experimentation and Results 106
Result:
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 7| Conclusions and Future Directions 107
CHAPTER 7
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 7| Conclusions and Future Directions 108
Urdu adjectival phrases are morphologically complex. In Section 4.1, we have discussed
both marked and unmarked adjectives, which are borrowed from many languages, like
Persian, Arabic, Hindi, Sanskrit, and English. This diversity results into flexibility and
variety in the morphological and grammatical rules. For example, the adjectives which
are Persian loan follow Persian grammar and usually remain unmarked, likewise, the
Sanskrit based adjectives show inflections for gender and number, etc.
Almost all types of adjectives, descriptive, attributive, predicative, demonstrative, etc.
show agreement in case, gender and number with the noun they qualify.
Similarly, some other linguistic phenomena are specific to Urdu language, e.g., frequent
reduplication (partial as well as full), compounding, frequent inflections and derivations.
Moreover the above mensioned linguistic aspects of the Urdu language result into much
complex lexicons. There is a much higher out of vocabulary rate as compared to other
well defined grammars. Also, it results into poor or unreliable language model probability
estimation, because there are many combinations of word forms which are missing or
rarely available in the language model training data.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 7| Conclusions and Future Directions 109
It is observed that the domain of the test beds affect the classification accuracy. The
results for one domain are different from the other. Moreover, the orientation of the text
to be analysed affects the accuracy to much extent. The negative reviews are more prone
to be misclassified than the positive ones.
For this reason our approach handles the phase-level negation as part of the SentiUnits,
which contain adjectives as the core terms and include the negation particles as their
logical constituents. Hence, the total effect of the negation is dealt along with the effect
of the subjective words. This approach is much appropriate to handle the free word order
property of the Urdu language. Also, it handles the variant grammatical structures of the
Urdu sentences, very successfully, as indicated by the experimentation results, with an
overall accuracy of 77%.
Although, shallow parsing based approach is appropriate for handling the simple
opinions, but it results into misclassifications when applied on complex sentences with
multiple targets. Therefore, the approach presented in Model B, which uses dependency
parsing after the shallow parsing is much more reliable.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
Chapter 7| Conclusions and Future Directions 110
languages, like, Punjabi, Persian, Sindhi etc, which have same orthography and very
similar grammar rules.
Most of the research works presented for English language rely only on the extraction of
the adjectives or adjectival phrases. There are a very few contributions which have
considered adverbs or adverbial phrases. In future, we deem to extend our model by
adding adverbial phrases in combination with adjectival phrases for handleling more
diversified opinions. In this way both aspects, i.e., functions and attributes of the target
product can be handled. The main strength of this model is its flexibility. As we have
considered the classification at the phrase level so we can add new rules and new phrases
very easily to the core model without making major alterations in the algorithm.
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
References 111
REFERENCES
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
References 112
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
References 113
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
References 114
35. Kennedy A, Inkpen D (2006) Sentiment classification of movie and product reviews
using contextual valence shifters. Computational Intelligence 22(2):110–125
36. Kennedy, Inkpen, D.: (2005) Sentiment Classification of Movie Reviews Using
Contextual Valence Shifters. In: Proceedings of FINEXIN (2005)
37. Khan S A, Anwar W, Bajwa U I (2011) Challenges in developing a rule based Urdu
stemmer, In: Proceedings of 2nd workshop on south and southeast Asian Natural
Language Processing, pp 46-51.
38. Kim S-M, Hovy E (2006) Automatic identification of pro and con reasons in online
reviews. In: Proceedings of the COLING, Sydney pp 483–490
39. Kumar A, Siddiqui T (2008) An Unsupervised Hindi Stemmer with Heuristics
Improvements. In: Proceedings of the Second Workshop on Analytics for Noisy
Unstructured Text Data.
40. Lehal GS (2009) A two stage word segmentation system for handling space insertion
problem in Urdu script. In: Proceedings of world academy of science, engineering and
technology, Bangkok pp 321–324
41. Lehal GS (2010) A word segmentation system for handling space omission problem in
Urdu script. In: Proceedings of the 1st workshop on South and Southeast Asian natural
language processing (WSSANLP), the 23rd international conference on computational
linguistics, COLING, Beijing, pp 43–50
42. Moilanen, K., Pulman, S.: The Good, the Bad, and the Unknown. In: Proceedings of
ACL/HLT (2008)
43. Muaz A, Ali A, Hussain S (2009) Analysis and development of Urdu POS tagged
corpora. In: Proceedings of the 7th workshop on Asian language resources, ACL-
IJCNLP, Suntec, Singapore, pp 24–31
44. Mukund S, Ghosh D (2011) Using sequence kernels to identify opinion entities in Urdu.
In: Proceedings of the 15th conference on Computational Natural Language Learning, pp
58-67.
45. Mukund S, Ghosh D, Srihari RK (2010) Using cross-lingual projections to generate
semantic role labeled corpus for Urdu—a resource poor language. In: Proceeding of the
23rd international conference on computational linguistics COLING, Beijing pp 797–805
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
References 115
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
References 116
57. Riloff E, Wiebe J, Wilson T (2003) Learning subjective nouns using extraction pattern
bootstrapping. In: Proceedings of the 7th conference on natural language learning,
Edmonton, pp 25–32
58. Rizvi SMJ, Hussain M (2005) Modeling case marking systems of Urdu-Hindi languages
by using semantic information. In: Proceedings of natural language processing and
knowledge engineering, pp 85–90
59. Schmidt RL (1999) Urdu: an essential grammar. Routledge Publishing, New York
60. Sharifloo A A, Shamsfard M (2008) A Bottom up Approach to Persian Stemming. In:
Proceedings of the 3rd International Joint Conference on Natural Language Processing.
61. Singh A, Bendre S, Sangal R (2005) HMM based chunker for Hindi. In: Proceedings of
IJCNPL-05: 2nd international joint conference on Natural Language Processing.
62. Snyder B, Barzilay R (2007) Multiple aspect ranking using the Good Grief algorithm. In:
Proceedings of the joint human language technology/North American chapter of the ACL
conference, Rochester, NY pp 300–307
63. Stone PJ, Dunphy DC, Smith MS, Ogilvie DM (1966) The general inquirer: a computer
approach to content analysis. MIT Press, Cambridge
64. Syed AZ, Muhammad A, Martínez-Enríquez AM (2012) Associating Targets with
SentiUnits: A Step Forward in Sentiment Analysis of Urdu Text. In: Artificial
Intelligence Review.
65. Syed AZ, Muhammad A, Martínez-Enríquez AM (2011) (a) Sentiment Analysis of Urdu
Language: Handling Phrase-Level Negation. In: Proceedings of the 10thMexican
international conference of artificial intelligence, pp 382–393
66. Syed, AZ, Muhammad A, Martinez-Enriquez, AM (2011) (b) Adjectival Phrases as the
Sentiment Carriers in the Urdu Text. Journal of American Science 7(3), 644–652
67. Syed AZ, Muhammad A, Martínez-Enríquez AM (2011) (c) Sentiment-Annotated
Lexicon Construction for an Urdu Text Based Sentiment Analyzer. In: Pakistan Journal
of Science (2011), ISSN: 0030-9877
68. Syed AZ, Muhammad A, Martínez-Enríquez AM (2010) Lexicon based sentiment
analysis of Urdu text using SentiUnits. In: Proceedings of the 9th Mexican international
conference of artificial intelligence, Pachuca, Mexico, pp 32–43
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.
References 117
69. Tan S, Cheng X, Wang Y, Xu H (2009) Adapting Naive Bayes to domain adaptation for
sentiment analysis. In: Proceedings of the 31st European conference on IR research on
advances in information retrieval, pp 337–349
70. Thabet N (2004) Stemming the Qur’an. In: Proceedings of the Workshop on
Computational Approaches to Arabic Script-based Languages.
71. Tsarfaty R, Seddah D, Goldberg Y, Kübler S, Candito M, Foster J, Versley Y, Rehbein I,
Tounsi L (2010) Statistical parsing of morphologically rich languages (SPMRL) what,
how and whither. In: Proceedings of the NAACL HLT 2010 first workshop on statistical
parsing of morphologically-rich languages, Los Angeles, pp 1–12
72. Turney P (2002) Thumbs up or thumbs down? Semantic orientation applied to
unsupervised classification of reviews. In: Proceedings of 40th meeting of the association
for computational linguistics, Philadelphia, PA, pp 417–424
73. Turney P, Littman M (2003) Measuring praise and criticism: inference of semantic
orientation from association. ACM Trans Inf Syst 21(4):315–346
74. Whitelaw C, Garg N, Argamon S (2005) Using appraisal groups for sentiment analysis.
In: Proceedings of ACM SIGIR conference on information and knowledge management
(CIKM 2005), Bremen, pp 625–631
75. Wiebe J, Wilson T, Bruce R, Bell M, Martin M (2004) Learning subjective language.
Comput Linguist 30(3):277–308
76. Wiegand, M., et al.: A survey on the role of negation in sentiment analysis. In:
Proceedings of the Workshop on Negation and Speculation in Natural Language
Processing 2010. Association for Computational Linguistics (2010)
77. Wilson, T., Wiebe, J., Hoffmann, P.: Recognizing Contextual Polarity in Phrase-level
Sentiment Analysis. In: Proc. HLT/EMNLP (2005)
78. Yang K,YuN,ValerioA, ZhangH(2006)WIDIT in TREC 2006 Blog Track. In:
Proceedings of Text REtrieval conference—TREC
79. Yu H, Hatzivassiloglou V (2003) Towards answering opinion questions: separating facts
from opinions and identifying the polarity of opinion sentences. In: Proceedings of
EMNLP’03, pp 129–136
______________________________________________________________________________________
Redefining Urdu Morphology & Grammar for the Development of an Integrated Sentiment Analysis
Framework.