Documente Academic
Documente Profesional
Documente Cultură
contemporary world of internet. Researchers have treated words such as word frequency, word co- occurrence,
keyword extraction as a classification problem where the input term nesting statistics, etc. is used to identify the
candidate words are classified as keywords or non-keywords. keywords. This approach does not require any training
We propose a supervised machine learning method for data. The specified features are extracted for the
keyword extraction using SVM. Noun phrases extracted using document and based on these obtained values, the
a specified regular expression are considered as candidate keywords are listed.
words. The classifier uses a combination of statistical and
linguistic features such as part of speech, frequency, spread etc. Linguistic approach: This approach includes usage of
to classify the extracted candidate words. The proposed method parts of speech such as nouns and adjectives in sentence
shows promising results in the classification task. semantics. Nouns phrases are extracted and then further
methods are used to weigh them to be identified as a
Keywords— Keyphrase extraction, Noun phrases, Feature keyword.
extraction, SVM, Confusion matrix Machine Learning Based Approach: A more
sophisticated approach than the previous two which
I. INTRODUCTION uses the concept of training a classifier using a dataset
In today’s era, tremendous information is available on the containing the documents along with their extracted
web. Thousands of books, journals, articles, papers etc are keywords. This classifier is then used to predict the
published on a daily basis about a single topic. Whenever keywords of a document. Various techniques that can
studying any topic, it is nearly impossible to read the be used are Support Vector Machine (SVM), Naive
complete documents. Therefore, an efficient information Bayes, etc.
retrieval method is required to extract relevant information Also, the keyword extraction methods can be classified as
which generally involves extracting the index terms or extractive or abstractive. The extractive methods directly
keywords from the documents. Keywords are smallest unit extract keywords from the text whereas the abstractive
that play a significant role in determining the main context of methods work similar to a human and generate keywords by
a given document. They can be used to easily identify the using contextual knowledge of the text. These keywords
meaning of the entire document which can be easily generated by abstractive methods may not be directly present
exploited by various applications like text summarizer, topic in the text and may be a combination of some words of the
detector, cataloger, etc. text.
Automatic keyphrase extraction is defined as “the The proposed ML method uses extractive approach to find
automatic selection of important and topical phrases from the and then classify the candidate words.
body of a document”[1]. The main goal of keyword extraction
is to extract a set of phrases that help in grasping the central II. LITERATURE SURVEY
idea of the document without the need to read it. The readers
Many supervised and unsupervised extractive approaches
of all kinds of articles whether academic, sports, business,
have been proposed till date.
social etc are benefited greatly from the automatically
generated keywords. Keywords of a document are of great The Kea System [2] firstly chooses the candidate phrases
importance to search engines as they help in providing precise (maximum length three) by cleaning the input text and then
results of the queries. Having such a large importance, the identifying the candidate by removing stopwords and excluding
topic automatic keyword extraction has received great proper nouns. Next, stemming is applied to get the final list of
attention in the recent years. Many automated keyword candidates. Secondly, selection of key phrases is done on the
extraction systems have been introduced which are based on basis of two features namely TF-IDF (term Frequency - Inverse
following three approaches - Document Frequency) and first occurrence. Discretization is
then applied to convert features into nominal data which can be
consumed by machine learning classifiers. Lastly, Naive Bayes
is trained to build a classifier on the basis of these features.
The keyphrase extractor that Turney [1] suggested also Mihalcea and Tarau (2004) [4] proposed the unsupervised
works on a similar approach but uses candidate phrases of approach to find the keywords using a TextRank model defined
maximum length five. Features used for weighing the by distance between co-occurrences of same word. The
candidate key-phrases are frequency, relative length, first candidate words are limited by the use of only a certain lexical
occurance of phrase, number of words in phrase, etc. Decision units based on part of speech. The vertices or the candidate
tree is used to detect the keyphrases. words are then ranked and the top T are said to be the extracted
keywords.
Krulwich and Burkey [3] created an InformationFinder
through “the extraction of semantically significant phrases”. Nguyen and Kan (2007) [5] proposed keyphrase
The tasks which make it innovative are using heuristics to extraction for scientific articles by using morphological
extract important phrases for learning rather than performing features found in scientific articles.
complex mathematical calculations on them and more
importantly creating decision trees to determine user’s Wan and Xiao (2008) [6] proposed a single document
interest. Finally, the decision tree is transformed into a keyphrase extraction method using neighborhood knowledge
“boolean search query string”. whereby an expanded document is created by adding neighbor
documents to the concerned document and then keywords are
ranked using a graph utilizing both the local and global the end of the document has the maximum weight
information in the neighbor documents. again.
Kathait, Tiwari (2017) [7] proposed an unsupervised Parabolic Position: Parabolic position ranks or
approach based on extraction of noun phrases. The candidate weighs the candidate words using a similar
words only contain adjectives and nouns. Firstly, individual approach with a difference that the weight of the
words are scored using a scoring scheme such as TF-IDF and candidate words decreases and increases according
secondly, the n-grams or phrases are given scores equal to the to the equation of a parabola instead of a line.
sum of scores of individual words present in them. The top
ranked phrases are considered as the output keyphrases. iii. Spread of Keyphrase in document: The spread of the
word in the document has been found as a beneficial
III. PROPOSED WORK feature in a closely related task of link generation [11].
The proposed supervised keyphrase extraction scheme A more important keyphrase has more standard
uses the DUC-2001 benchmark dataset that has around 300 deviation which implies it is widely spread in the text
news articles [6]. The methodology consists of four main document and thus proves to be an essential feature.
tasks: (1) Candidate Selection (2) Feature Extraction (3) Hence, spread calculated using standard deviation of
Training SVM (4) Prediction of keyphrases. the candidate in the document forms a part of the
feature list.
Fig. 1 The training and keyword prediction process iv. Frequency: A more important word tends to occur
repeatedly and as a result has higher frequency in the
A. Candidate Selection document [10]. Hence, we consider frequency as a
feature in our proposed work.
A set of words and phrases is extracted as candidate
keyphrases using the regular expression [8]. v. Occurrences of candidates in the documents:
According to Louis and Gagnon (2013) [12], position
{(<Adjective>* <Noun>+ <Preposition>)? <Adjective>* based features such as first and last occurrence have
<Noun>+} been found effective in keyword extraction. We used
Phrases following this pattern are noun phrases. Noun normalized position of first occurrence of the
phrases are selected because nouns contains tremendous candidate and then extended this approach to use the
amount of information related to the document. Noun normalized position of all occurrences of the candidate
phrases can be better understood by knowing the most in the documents as part of the feature vector.
frequently occurring patterns in them i.e. Noun, Adjective Values for these features are extracted for the selected
Noun, Noun Noun and Noun Preposition Noun. candidate words. All these features are then stored as key value
pairs for each candidate word.