Sunteți pe pagina 1din 3

Abstract—Keyphrase extraction is an essential part of  Statistical Approach: The statistical information of the

contemporary world of internet. Researchers have treated words such as word frequency, word co- occurrence,
keyword extraction as a classification problem where the input term nesting statistics, etc. is used to identify the
candidate words are classified as keywords or non-keywords. keywords. This approach does not require any training
We propose a supervised machine learning method for data. The specified features are extracted for the
keyword extraction using SVM. Noun phrases extracted using document and based on these obtained values, the
a specified regular expression are considered as candidate keywords are listed.
words. The classifier uses a combination of statistical and
linguistic features such as part of speech, frequency, spread etc.  Linguistic approach: This approach includes usage of
to classify the extracted candidate words. The proposed method parts of speech such as nouns and adjectives in sentence
shows promising results in the classification task. semantics. Nouns phrases are extracted and then further
methods are used to weigh them to be identified as a
Keywords— Keyphrase extraction, Noun phrases, Feature keyword.
extraction, SVM, Confusion matrix  Machine Learning Based Approach: A more
sophisticated approach than the previous two which
I. INTRODUCTION uses the concept of training a classifier using a dataset
In today’s era, tremendous information is available on the containing the documents along with their extracted
web. Thousands of books, journals, articles, papers etc are keywords. This classifier is then used to predict the
published on a daily basis about a single topic. Whenever keywords of a document. Various techniques that can
studying any topic, it is nearly impossible to read the be used are Support Vector Machine (SVM), Naive
complete documents. Therefore, an efficient information Bayes, etc.
retrieval method is required to extract relevant information Also, the keyword extraction methods can be classified as
which generally involves extracting the index terms or extractive or abstractive. The extractive methods directly
keywords from the documents. Keywords are smallest unit extract keywords from the text whereas the abstractive
that play a significant role in determining the main context of methods work similar to a human and generate keywords by
a given document. They can be used to easily identify the using contextual knowledge of the text. These keywords
meaning of the entire document which can be easily generated by abstractive methods may not be directly present
exploited by various applications like text summarizer, topic in the text and may be a combination of some words of the
detector, cataloger, etc. text.
Automatic keyphrase extraction is defined as “the The proposed ML method uses extractive approach to find
automatic selection of important and topical phrases from the and then classify the candidate words.
body of a document”[1]. The main goal of keyword extraction
is to extract a set of phrases that help in grasping the central II. LITERATURE SURVEY
idea of the document without the need to read it. The readers
Many supervised and unsupervised extractive approaches
of all kinds of articles whether academic, sports, business,
have been proposed till date.
social etc are benefited greatly from the automatically
generated keywords. Keywords of a document are of great The Kea System [2] firstly chooses the candidate phrases
importance to search engines as they help in providing precise (maximum length three) by cleaning the input text and then
results of the queries. Having such a large importance, the identifying the candidate by removing stopwords and excluding
topic automatic keyword extraction has received great proper nouns. Next, stemming is applied to get the final list of
attention in the recent years. Many automated keyword candidates. Secondly, selection of key phrases is done on the
extraction systems have been introduced which are based on basis of two features namely TF-IDF (term Frequency - Inverse
following three approaches - Document Frequency) and first occurrence. Discretization is
then applied to convert features into nominal data which can be
consumed by machine learning classifiers. Lastly, Naive Bayes
is trained to build a classifier on the basis of these features.

The keyphrase extractor that Turney [1] suggested also Mihalcea and Tarau (2004) [4] proposed the unsupervised
works on a similar approach but uses candidate phrases of approach to find the keywords using a TextRank model defined
maximum length five. Features used for weighing the by distance between co-occurrences of same word. The
candidate key-phrases are frequency, relative length, first candidate words are limited by the use of only a certain lexical
occurance of phrase, number of words in phrase, etc. Decision units based on part of speech. The vertices or the candidate
tree is used to detect the keyphrases. words are then ranked and the top T are said to be the extracted
keywords.
Krulwich and Burkey [3] created an InformationFinder
through “the extraction of semantically significant phrases”. Nguyen and Kan (2007) [5] proposed keyphrase
The tasks which make it innovative are using heuristics to extraction for scientific articles by using morphological
extract important phrases for learning rather than performing features found in scientific articles.
complex mathematical calculations on them and more
importantly creating decision trees to determine user’s Wan and Xiao (2008) [6] proposed a single document
interest. Finally, the decision tree is transformed into a keyphrase extraction method using neighborhood knowledge
“boolean search query string”. whereby an expanded document is created by adding neighbor
documents to the concerned document and then keywords are
ranked using a graph utilizing both the local and global the end of the document has the maximum weight
information in the neighbor documents. again.
Kathait, Tiwari (2017) [7] proposed an unsupervised  Parabolic Position: Parabolic position ranks or
approach based on extraction of noun phrases. The candidate weighs the candidate words using a similar
words only contain adjectives and nouns. Firstly, individual approach with a difference that the weight of the
words are scored using a scoring scheme such as TF-IDF and candidate words decreases and increases according
secondly, the n-grams or phrases are given scores equal to the to the equation of a parabola instead of a line.
sum of scores of individual words present in them. The top
ranked phrases are considered as the output keyphrases. iii. Spread of Keyphrase in document: The spread of the
word in the document has been found as a beneficial
III. PROPOSED WORK feature in a closely related task of link generation [11].
The proposed supervised keyphrase extraction scheme A more important keyphrase has more standard
uses the DUC-2001 benchmark dataset that has around 300 deviation which implies it is widely spread in the text
news articles [6]. The methodology consists of four main document and thus proves to be an essential feature.
tasks: (1) Candidate Selection (2) Feature Extraction (3) Hence, spread calculated using standard deviation of
Training SVM (4) Prediction of keyphrases. the candidate in the document forms a part of the
feature list.

Fig. 1 The training and keyword prediction process iv. Frequency: A more important word tends to occur
repeatedly and as a result has higher frequency in the
A. Candidate Selection document [10]. Hence, we consider frequency as a
feature in our proposed work.
A set of words and phrases is extracted as candidate
keyphrases using the regular expression [8]. v. Occurrences of candidates in the documents:
According to Louis and Gagnon (2013) [12], position
{(<Adjective>* <Noun>+ <Preposition>)? <Adjective>* based features such as first and last occurrence have
<Noun>+} been found effective in keyword extraction. We used
Phrases following this pattern are noun phrases. Noun normalized position of first occurrence of the
phrases are selected because nouns contains tremendous candidate and then extended this approach to use the
amount of information related to the document. Noun normalized position of all occurrences of the candidate
phrases can be better understood by knowing the most in the documents as part of the feature vector.
frequently occurring patterns in them i.e. Noun, Adjective Values for these features are extracted for the selected
Noun, Noun Noun and Noun Preposition Noun. candidate words. All these features are then stored as key value
pairs for each candidate word.

All these phrases essentially contain noun as their component


and hence, are termed as noun phrases. C. Training the SVM classifier
B. Feature Extraction DUC-2001 dataset is split into training and testing data.
The above mentioned features for candidate words are
There are five features that we used to represent the extracted using the proposed regular expression. During
candidate keyphrase which are explained in detail below: training, the SVM classifier is provided with these features
along with 1 or 0 indicating whether the candidate is a keyword
i. POS sequences: According to Hulth and
or non-keyword respectively. The ratio of sizes of keyword list
Megyesi(2006) [9], keyphrases tend to have
and non-keyword list is adjusted to achieve the highest possible
distinctive distribution of Part of Speech sequences
value of f-measure.
and hence its worth using them as a feature. This is
done using POS tagging which identifies the D. Prediction of keyphrases
individual word in candidate phrases as a common
After training the SVM classifier, we test it by predicting
noun, proper noun, singular noun, plural noun, or
the keyphrases from the testing data. The features of
an adjective, preposition, etc. These POS tags of
candidate words are again extracted and input to the trained
the candidate then form a part of the feature list.
model to classify them as a keyword or non-keyword. The
ii. Position of Keyphrase: The beginning [10] and resulting values of precision, recall and f-measure are
concluding part of a document typically contains recorded.
more relevant information to the topic being
IV. OBSERVATIONS
addressed, thus we use parabolic and line position as
a feature to rank the words such that the candidate The available dataset was split such that about 60 percent
words at the beginning and end of a document are of the total scripts were used for training the classifier and
assigned more weight. remaining 40 percent data was used for testing the classifier.
We plot confusion matrix during testing of the classifier and
 Line Position: Line position ranks the candidate then describe the performance of our system in terms of the 3
phrases in a way that the words in the beginning standard metrics i.e. precision, recall and f-measure,
of the document have the maximum weight, calculated from this confusion matrix. Initially, there was a
which decreases linearly; is least for a candidate large imbalance in the values of both classes (one being the
word at the mid of the document and increases keyword and other being the non-keyword) of training data.
linearly thereafter such that a candidate word at Training the classifier using this data yields a high value of
precision but a low recall. Whereas completely balancing the
55.07 3.877 7.24
values in both the classes yields a very high recall but a very
low value of precision. Hence, we trained our classifier by Table 2: Observations for unbalanced dataset
restricting the values in non-keyword class to an optimal In this case, precision was very high but recall became
value such that we get the highest possible f-measure on the very low which again resulted in a very low F-measure of
testing data. Firstly, we give a brief description of confusion only 10.6%. The classifier is over-particular and does not
matrix, precision, recall and f-measure and then their assume many candidate words to be keywords. Therefore,
respective values achieved. misses a lot of keywords.
C. Optimally Balanced Dataset
 Confusion Matrix: It is a form of a table that is used to
determine the performance of a classifier using a set
We tried balancing the dataset to get the highest possible
of data having known true values. The three standard
value of f-measure. This was achieved when there were around
performance metrics used are based on this confusion
1500 samples corresponding to keyword class and 8400
matrix.
samples corresponding to non-keyword class.
 Precision: “Precision (also called positive predictive
value) is the fraction of relevant instances among the
retrieved instances”. Precision Recall F-measure
 Recall: “ In pattern classification, recall (also known
as sensitivity) is the fraction of relevant instances that 30.8 37.75 33.91
have been retrieved over the total amount of relevant Table 3: Observations for optimally balanced dataset
instances.”
 F-measure: In pattern classification, f-measure is Here, it is seen that F- measure is considerably high.
calculated as the harmonic mean of precision and
recall. V. EVALUATION
For the evaluation of the effectiveness of our proposed
The values of precision, recall and f-measure were system, we compared it to the previously achieved best scores
recorded for the testing dataset when trained with different on DUC 2001 dataset.
values of balance of classes.
The highest achieved precision, recall and f-measure scores as
mentioned by Hasan and Ng [13] are

Precision Recall F-measure

28.8 35.4 31.7


Table 4: The highest achieved values on DUC-2001 dataset[13]
Our achieved scores are considerably higher which are
30.8, 37.75 and 33.91 respectively.
A. Completely Balanced Dataset

This dataset has around 1500 samples of each class for


training. VI. CONCLUSION
We have presented a supervised approach for keyword
extraction. Noun phrases are selected as candidate words
Precision Recall F-measure because majority of information of the document is contained
in them. The paper also introduces five major features that
12.93 82.04 22.33 affect the performance of the SVM classifier. The main finding
is that the ratio of values in the two classes of the training set
Table 1: Observations for completely balanced dataset greatly affects the precision and recall. Since f- measure is the
harmonic mean of precision and recall, it is considered as the
We noted that recall was very high but precision was very
ultimate measure of performance. The optimal balance between
low resulting in a low f-measure of 22.33%. This is
the two classes yields a promisingly high value of f-measure.
because the classifier predicted a large number of false
We also concluded that part of speech was the most essential
positives i.e. a large no. of wrongly predicted keywords.
feature since it increased the f- measure value to a great extent.
The classifier assumes a lot of candidate words as
keywords, a lot of them not actually being the keywords. Although the values of f-measure attained are
Nonetheless the actual keywords are correctly classified. encouraging however there is a scope of improvement by
B. Unbalanced Dataset incorporating more features that increase its value. The
proposed model can also be extended to include more parts of
Here, we did not impose any restriction on the size of speech such as adverbs in the regular expression used to
classes in the training dataset and there was a large imbalance extract candidate keyphrases.
in the size of two classes.

Precision Recall F-measure

S-ar putea să vă placă și