Documente Academic
Documente Profesional
Documente Cultură
If you use the keyword extraction software or dynamic link library (dll) in your
program or research, please indicate that the part of paper and program cites the
following paper.
Zhen YANG, Jianjun LEI, Kefeng FAN, Yingxu LAI. Keyword Extraction by Entropy
Difference between the Intrinsic and Extrinsic Mode, Physica A: Statistical Mechanics
and its Applications, 392 (2013), 4523-4531.http://dx.doi.org/10.1016/j.physa.2013.05.052
Introduction
As is shown in the title, we propose a text keyword extraction method in
programming, and provide the form of DLL and software tools for everyone to use,
welcome to make any comments and suggestions to our DLL and software tools.
For Chinese text, in order to format the text, you should follow the next steps. First,
remove the punctuation and charts in the text. Then, divide the sentences into a list of
words. Two successive words are separated by a space. Here we provide the function
to remove the punctuation and charts. You should ensure that the sentences in the text
has been divide into a list of words. Last, you can select one of the methods----general
entropy or Maximum Entropy to extract the keywords.
With these things, you can easily complete text keyword extraction work!
Background
One of the most significant differences between human-written texts and monkeys
typing is the general existence of meaningful topics in human written texts.
Keyword/relevant word extraction and ranking are the starting point for critical tasks
like topic detection and tracking in written texts, and they are widely applied in
information extraction, selection and retrieval.
Here's the brief introduction of the principle of the algorithm, it can help you
understand and use the dll and software better.
The idea of intrinsic-extrinsic mode is based on the general idea that highly
significant words tend to be modulated by the writer’s intension, while common
words are essentially uniformly spread throughout a text. So the intrinsic mode
denotes the statistical properties of the appearance of a relevant word within a topic,
i.e., the statistical properties of clustering within each topic. Meanwhile, the extrinsic
mode captures the statistical properties of the disappearance of a word clustering
along a written text and it characterizes the relationship between word clustering
occurrence within a topic and an author’s written style. As shown in FIG. 2. the
distances between two words which is successive occurrences is defined as di = ti+1 − ti.
Ti is the position of the word in the text. The arrival time difference di belongs to the
intrinsic mode if di <μ. In other words, a given occurrence of the word is a part of an
intrinsic mode if its local separation is less than its mean waiting time. Let dI = {di|di <
μ} be the union set for all di <μas shown in the bottom-left figure in FIG. 2.
We found through experiments, that the keyword which appears in the article
presents the characteristics of aggregates. so its intrinsic mode entropy is large while
its extrinsic mode entropy is small; the general words are evenly distributed in the
article, any two consecutive word spacing appears little change, so the entropy
difference between the intrinsic and extrinsic mode is small. In this way, you can use
the value E which is the entropy difference between the intrinsic and extrinsic mode
to extract keywords. In practice, in order to eliminate the words randomly distributed
and boundary conditions, we use the Cc boundary conditions and the normalized
entropy difference Enor as the final indicators. If you want to learn more details of this
algorithm, please click here (http://dx.doi.org/10.1016/j.physa.2013.05.052) to view
the full paper.
Usage
Now we will make a detailed description for the keyword extraction software and
use of dll. Here are two samples that we will use in numerous examples to illustrate
the performance of Enor metric, one is scientific book in English, the orther is a news
report in Chinese.
Please Note: To start the evaluation of the text, any punctuation symbols were
removed from the text, all words were changed to lowercase and then a simple
tokenization method based on whitespaces was applied. For the Chinese text, the extra
chinese word segmentation would be done at first.
In the keyword extraction software, we also provide you with a text pre-processing
function. After pre-processing, the standardized format of text as follows:
Here provides two versions of the C++ dll, release version of dll and debug version of
dll, please select the corresponding dll for use. Such as, using the dll in the "release"
folder if you want to compile your code with the solution configuration as “realse”
method. We recommend to use the release version because it's faster than the
debug version.
Step1:
In the unzipped folder find dll (in the c++ folder) and click the release folder. There
will be three file in the folder,as show in the following picture.Then,copy these three
files in your project and import the "Node.h" file in your project directory
Step2:
Please note:The order of variables in the structure in "Node.h" can not be changed!
#pragmacomment (lib,"Keyword_Extraction.lib")
/*input: string text ,the text after pretreatment
int &num ,return the size of the Node array
return: Node* ,return the keyword array*/
extern Node* keyword_extra_entropy(string text,int&num); // return the keyword array with the general
entropy method
extern Node* keyword_extra_entropy_MAX(string text,int&num); // return the keyword array with the
Maximum entropy method
The dll encapsulates two functions: Node * keyword_extra_entropy (string text, int
& num) and Node * keyword_extra_entropy_MAX (string text, int & num).
Respectively, first function use general entropy method, while to second function
use maximum entropy method to calculate the maximum entropy. Both of the
functions have two inputs: string type - preprocessed text; int type - return the size
of the array. output: Node* type - the array of type Node, the structure Node include
the content which is introduced above.
Step3:
After following above steps, you can call the function to get keywords, such as the
following code to showing TOP-10 Keywords:
int i;
int num;
Node *result;
result=keyword_extra_entropy_MAX(text,num);
for(i=0;i<10;i++)
cout<<endl<<result[i].word<<"==="<<result[i].EDnor<<"==="<<result[i].frequency;
Example:
Now, we select the book "Origin of Species" as an example, and demonstrate the
whole process of using the dll:
code:
#include<fstream>
#include<iostream>
#include<string>
#include"Node.h"
void main(){
//output all keywords in the array, here “num” is the size of the array.
for(int i=0;i<num;i++)
cout<<endl<<result[i].word<<"==="<<result[i].EDnor<<"==="<<result[i].frequency;
system("pause");
}
The dll of C# version packages the entire class, so there contains more functions than
the dll of C++ version(including preprocessing functions, etc.). Please refer to file
"KEBOED interface documentation" to learn the usage of C# dll.
For the Chinese sample, we have chosen a news report on the network, the title of
this report is《让雷锋精神代代相传》. We use the keyword extraction software, and
select the "maximum entropy" keyword extraction method and get the following
result:
This two samples text will also be given in the compressed package.
Conclusion
In summary, understanding the complexity of human written text requires an
appropriate analysis of the statistical distribution of the words in texts. We find highly
significant words tend to be modulated by the writing writer’s intension, while
common words are essentially uniformly spread in a text. The ideas of this work can
be applied to any natural language with words clearly identified, without requiring
any previous knowledge about semantics or syntax.
License
This article has no explicit license attached to it but may contain usage terms in the
article text or the download files themselves. If in doubt please contact the author via
the discussion board below.