Sunteți pe pagina 1din 4

2009 International Conference on Computational Intelligence and Security

An Ontology-Based NLP Approach to Semantic Annotation of Annual Report

Baohua Wang,Hejiang Huang,Xiaolong Wang

Wensheng Chen

School of Computer Science and Technology


Harbin Institute of Technology Shenzhen Graduate
School
Shenzhen, China
bhwangszu@gmail.com,
hjhuang@hitsz.edu.cn,wangxl@insun.hit.edu.cn

College of Mathematics and Computational Science


Shenzhen University
Shenzhen, China
chenws@szu.edu.cn

important and challenging problem because PDF encoding


is visualization-oriented [5].
In this paper we present a ontology-based NLP approach
to semantic annotation of annual reports. Firstly, we convert
annual reports to XML using Save As XML Plug-In in
Acrobat, which make the annotation possible. But some
abnormalities and traps discovered while converting PDF
documents. To solve this kind of problem, we use pattern
match to get all possible identifiers which are the tokens
followed by the contents needed to be annotated, thus get
the lexicon of identifiers and we then use K-Means
Clustering to cluster all identifiers to groups ,that is say, get
their property name in the ontology. Finally, after delete all
space between the Chinese words, we use Backward
Maximum Matching algorithm which is a NLP approach
usually used as Chinese word segment method to get the
correct identifiers name, and annotate their following
contents based on the identifiers cluster.
This paper is organized as follows: in section 2 we
present main features of annual report. Section 3 describes
the securities company ontology we constructed. The
following one presents the approach we proposed. The
section 5 resumes an evaluation of the approach and, finally,
the last section concludes this paper and gives some
perspectives for future works.

AbstractAnnual Reports of Chinese Securities Companies


have become the most significant and reliable source of
information for domestic and foreign investors. Semantic
annotation of them enhanced information retrieval and
improved interoperability. In this paper we first review the
major features of annual reports which are Tagged PDF
format, then propose a novel ontology-based NLP approach to
semantic annotate them. The experimental results show in the
paper state a good accuracy of our approach.
Keywords-ontology; NLP; semantic annotation; Tagged
PDF documents; Annual Report

I.

INTRODUCTION

The current generation of search engines is missing a


semantic-level understanding of the query or the content and
can only understand the content of a document by picking
out the most commonly occurring or underlined words.
Semantic search defined as IR with the capabilities to
understand the users intent and webs content at a much
deeper, conceptual level attracted a lot of attention in the
past years [1]. However, above all, the accumulated and
available content need to be annotated with proper and high
quality semantic information for the implementation of such
semantic-aware services. Though, in some cases
unavoidable, the manual accumulation of these explicit
semantics is not considered a feasible approach [2]. Our
vision is that fully automatic methods for semantic
annotation should be researched and developed.
In this paper we will focus on the semantic annotation of
annual reports of Chinese securities companies. They are
Tagged PDF documents which follow the standard
requested by CSRC (China Securities Regulatory
Commission).As we have seen, PDF is now a common
exchange format between organizations, and often is the
only accessible format [3].Its main advantage is that it
preserves the visual presentation of a document between
screen and printer, and across different computing platforms,
thus making it a useful format for the exchange of
documents on the Web. However, this is also its main
drawback; PDF file contains little or no explicit structuring
information to help us locate content [4]. So information
extraction from PDF has been recently recognized as a very
978-0-7695-3931-7/09 $26.00 2009 IEEE
DOI 10.1109/CIS.2009.269

II.

ANNUAL REPORTS OF CHINESE


SECURITIES COMPANIES

Chinese companies are required to release their audited


annual reports no later than April 30 of the next calendar
year (available online through the following web site
http://www.cninfo.com.cn), and the reports need to follow
the principles which list at the document Guidelines for the
Content and Format of Annual Reports of Securities
Companies [6]from CSRC. The annual report has become
the most significant and reliable source of information on
listed Chinese firms for domestic and foreign investors, with
both annual operating results and some significant events
disclosed for the first time [7]. Aims to allow users a more
convenient and precise to query information in the annual
reports, we need to annotate the reports first. While the
general case is hard to be completely automated, there is a
visible progress in specific domains like scientific

180

publication archives. In the domain of annual reports of


China Listed Company, the structure of reports follows, in
the majority of the cases, a number of specific standards.
This fact allows adopting existing information processing
techniques contributing to the processing and annotation.
Annual reports are mostly Tagged PDF files which
contain an internal structure tree starting with release 1.4 of
the PDF specification [8], Tagged PDF extends traditional
PDF file to make it more useful for cases such as text
extraction, reflow, conversion and accessibility. Users can
now easily save contents into XML, HTML, XHTML or
plain text from a tagged Adobe PDF file using Save As PlugIn in Acrobat, and place those contents in other tools to reuse
the information [9].
The phrase shown in Figure 1(a) are taken from 2007
Annual Report of CHINA ZHENHUA(GROUP)
SCIENCE&
TECHNOLOGY
CO.,LTD(stock
code:000733), it means Office Address: the northern
section of New Road, No. 268, Wudang District, Guiyang
City, Guizhou Province, Zip code:550018. Annual reports
contain the contents shown in Figure 1 and tabular
information. In this study, we mainly discuss the content
shown in Figure 1. As shown in Figure 1(a), we refer to the
concept of identifier as the phrase before a colon which
refers to the semantic meaning of its following contents, the
contents needed to be annotated follow a certain pattern, we
can use pattern match (regular expression) to annotate easily.
But some abnormalities and traps discovered while
converting PDF documents. The most current trap consists of
over-segmented words. This problem depends on PDF
writers, which decompose words when typographic
properties change (in particular the kerning, i.e. the offset
between two Chinese words.) [5]. PDF format offers
instructions conceived to represent kerning modification in a
string, but very often PDF documents contain separated substrings, preserving visualization instead of content. Figure
1(b) illustrates a singular over-segmentation error, we can
find a space is added before (number),and there is no
separator between (number) and (zip
code ) In our experience, this kind of errors are not
marginal, texts which are wrongly tokenized may affect the
quality of semantic annotation.

SECURITIES COMPANY ONTOLOGY

III.

The securities company ontology represents the concepts


and relations underlying the expert knowledge of securities
companies. The purpose of capturing this meta-knowledge is
to provide a baseline for
Analysing and modelling securities company cases
Developing applications where different knowledge
bases such as lexicon for natural language parsing,
production rules for inference are deployed.
Now we have created ontology manually based on the
Guidelines for the Content and Format of Annual Reports
of Securities Companies, because all the annual reports
should contain the contents listed in the guideline from
CSRC. The ontology mainly has a securities company class
and an instance of the class is behalf of one securities
company; the company class has many properties which
requested in the guideline (We can use the property name as
the initial group centroids in our K-Means Clustering
algorithm). So to annotate the content of annual report is to
find right preceding identifier which corresponds to a
property in the ontology.
IV.

THE APPROACH

As we mentioned previously, we convert the Tagged


PDF ,which cant be annotated directly, into XML file
format using Save As XML Plug-In in Acrobat, but some
abnormalities exist and a property name in ontology can be
expressed some other words. Our proposed approach is
A pattern-based extraction constructing the identifier
lexicon based on the identifier frequency;
B
cluster the identifiers to groups using word
similarity calculation based on the properties name in the
ontology, thus get its corresponding property name in the
ontology for every identifier;
C Backward Maximum Match using lexicon
constructed to find the right identifier and annotate its
following content according to identifiers cluster.
We will explain it as follows:
A. Pattern-Based Extraction Constructing the Lexicon
Based on the Identifier Frequency
Since most of the identifiers are correct and they follow
this pattern: space identifier colon content (needed to be
annotated) space (or other delimiters), we gather all these
identifiers
based
on
the
pattern
(regular
expression).Identifiers whose frequency exceed certain
threshold can be seen as the correct identifiers, results show
that the method is a feasible way. In our experiments, we
found that the words have the frequency less then ten are
always wrong expression.
However, a few words that have the frequency greater
than threshold may be wrong expressions too, so the
identifiers are all sent to the NLP analyzer to check whether
they are grammatical or not. If an ungrammatical identifier is
parsed, the identifier will be removed from the lexicon. For
example, if the words have the frequency
greater than the threshold, using grammar checker we can

Figure 1(a) the phrase in original PDF


Figure 1(b) the converted phrase in XML

However, there are also other problems such as


identifiers with the same meaning can be expressed as
different words which generally tell us their following
contents semantic meaning. For example,
(stock short form) can be expressed as A
(company A share short form),
(company stock short form), it brings difficulties for us to
annotate the contents.

181

obviously find it is a wrong expression and should remove it


from our lexicon.

, in ontology respectively, so we can


easily annotate their following content.

B. Cluster the Identifiers to Groups Using Word Similarity


Calculation Based on the Properties Name in the
Ontology, Thus Get Its Corresponding Property Name in
the Ontology for Every Identifier;
An annual report should have all properties in our
ontology as requested be CSCR, but different identifiers may
be used as every annual report is released from different
company; that is to say, every identifier should have a
corresponding property name with the same meaning in our
ontology. So we can cluster the identifiers to groups using KMeans Clustering algorithm.
The major steps of cluster include:
1) Place all property names as initial group centroids.
2) Assign each identifier to the group that has the closest
centroid. is a chosen distance measure between a identifier
and the cluster centre ,here we calculate word similarity
between the identifier and centroid by computing their string
distance which considers distance as the amount of
difference between strings of symbols.
3) When all identifiers have been assigned, recalculate
the positions of the K centroids.
4) Repeat Steps 2 and 3 until the centroids no longer
move. This produces a separation of the objects into groups
from which the metric to be minimized can be calculated

PRELIMINARY EVALUATION

V.

Since we are working with sets of thousands of


documents, manual quality evaluation for whole collection is
not feasible. We select all the annual reports from the year of
2007, totally about 1500 reports to semantic annotate and
randomly select 100 reports manually check their
correctness.
For our evaluation task, we resort to traditional IR
notions of precision and recall. Precision measures the
fraction of the annotated contents that are correct, and recall
measures the fraction of contents have been correctly
annotated. Let AG be the set of annotated groups, and RG be
the set required groups. Precision and recall are defined as:

| AG RG |
| AG RG |
R=
| AG |
| RG |
where | AG RG | is the number of correctly
P=

annotated groups.
In Table I we report the results of our evaluation tests.
These preliminary results show that our procedure is capable
of achieving the quality level that is comparable with the one
present in the ideal set but without usage of external
manually verified datasets and without human supervision.
TABLE I.

C. Backward Maximum Match Using Lexicon Constructed


to Find the Right Identifier and Annotate Its Following
Content According to IdentifierS Cluster.
The hypothesis we do here is that words which occur in
ill-formed elements will occur correctly in other annual
report, we will then use correct tokenized words in order to
retokenize bad words. Because the main problems are too
much space or no space between the correct identifier, we
first delete all space between all Chinese words, but it also
will delete the delimiters, we use Backward Maximum
Matching (BMM) algorithm, which search backward from
every colon and match the longest possible identifiers by
referring to the lexicon we have constructed. Using the
longest identifier we found, we can know its property name
at the cluster, we can easily annotate its following contents.
For example, the sentences :
268
268 550018(Registered
Address: the northern section of New Road, No. 268,
Wudang District, Guiyang City, Guizhou Province ;Office
Address: the northern section of New Road, No. 268,
Wudang District, Guiyang City, Guizhou Province;Zip
code:550018) will be :
268
268 550018 when delete all
space. By using BMM, we can find the identifiers should be
(Registered Address), (Office
Address), (Zip code)because we have these
words in our lexicon, and they have the same meaning name
with property name ,

PRELIMINARY EVALUATION RESULTS


Precision
recall

VI.

80.1%
79.3%

RELATED WORK

A large body of work concerning approaches and


systems aimed at semantic annotation from semi-structured
and unstructured documents is currently available in
literature. The two Semantic Web frameworks Annotea and
CREAM favor different aspects of annotation activity [10].
Annotea, with its emphasis on collaboration has influenced
the development of a number of excellent systems with good
user interfaces that are well suited to distribute knowledge
sharing. CREAM, with its greater emphasis on the deep Web
and the annotation of legacy resources has pushed the
development of annotation systems more aimed towards
corporate KM. Already existing approaches and systems, as
described in[10],can be classified as manual, semi-automatic
and automatic annotation be considering the automation
degree adopted. Most significant manual annotations tools
are Amaya[11],Mangrove[12],Vannotea[13], they have a
great deal in common with purely textual annotation tools
but provide some support for ontologies. Some manual
annotation tools such as: OntoMat, S-CREAM [14], SHOE
Knowledge Annotator [15] have been developed to provide
more sophisticated user support and a degree of semiautomatic or automatic annotation facilities. Automation can
generally be regarded as falling into three categories. The
most basic kind use rules or wrappers written by hand that
try to capture known patterns for the annotations. Systems
that belongs to this family areLixto [16], MnM[17] Then

182

there are two kinds of systems that learn how to annotate.


Supervised systems like Melita[18],CAFETIERE[19] learn
from sample annotations marked up by the user. A problem
with these methods is that picking enough good examples is
a non-trivial and error-prone task. In order to tackle this
problem unsupervised systems like KIM[20] employ a
variety of strategies to learn how to annotate without user
supervision, but their accuracy is still limited. PDF-oriented
IE approaches appeared recently in literature after a seminal
Gottlobs work [21]. In particular Flesca et Al.[22] propose a
fuzzy logic approach that exploits spatial PDF feature in
recognizing relevant information; Baumgartner et Al.[23]use
document understanding techniques for identifying atomic
elements of PDF documents on which apply spatial
reasoning and wrapping techniques that enable to identify
significant document blocks; The ontology research area
influenced IE. A lot of work concerning the use of ontology
for extracting meaningful information from HTML Web
documents has been proposed in literature. At the best of our
knowledge no existing approach that deals with the problem
of semantic annotation of PDF documents by using ontology
has already been proposed.
VII.

[5]

[6]

[7]

[8]
[9]
[10]

[11]

[12]

CONCLUSIONS

The main contribution of this paper is a novel approach


to semantic annotation of annual report. The annual reports
are Tagged PDF format which is hard to annotate, we
converted to XML file format, but new difficult problems
appear. We constructed an ontology and use NLP method to
overcome this kind of problem. The experimental results
show in the paper state a good accuracy of our approach. Our
method does not rely on any user supervision; this can be the
foundation of semantic search in the future. The preliminary
evaluation encourages us to undertake new research topics.
In particular, our future work will deal with semantic
annotation of the tabular information in annual report. The
final scope of our work is the semantic search platform in the
domain of annual report, which will facilitate the investors
information retrieval of Chinese Securities Companies
information.

[13]

[14]

[15]
[16]

[17]

ACKNOWLEDGMENT

[18]

This work is supported by the High Technology Research


and Development Program of China (2006AA01Z197), the
Major Program of National Natural Science Foundation of
China (No.60435020), and the National Natural Science
Foundation of China (No.60873168, No.60603028).

[19]

REFERENCES
[1]
[2]

[3]

[4]

[20]

Peter Mika (2008) Semantic Search Arrives at the Web [Online].


Available: http://www.devx.com/semantic/Article/38595
Kiryakov, A., Popov, B., Terziev, I., Manov, D., Ognyanoff, D.:
Semantic Annotation, Indexing,and Retrieval. Journal of Web
Sematics 2, Issue 1, Elsevier (2004) 47-49
Dejean, H., Meunier, J.L.: A system for converting pdf documents
into structured XML format.In: Proceedings of the Workshop on
Document Analysis Systems, Nelson, NZ (2006)
Tamir Hassan, Wien, Austria Robert Baumgartner: Using graph
matching techniques to wrap data from PDF documents in

[21]

[22]

[23]

183

Proceedings of the 15th international conference on World Wide


Web,2006,901-902
Oro, E. Ruffolo, M.: XONTO: An Ontology-Based System for
Semantic Information Extraction from PDF Documents,in Conf.
ICTAI '08,2008, Volume: 1: 118-125
(2008)
The
CSRC
website.[Online].Availabe:
http://www.csrc.gov.cn/n575458/n575667/n4231514/n4231533/n469
6680/9310925.html
Haw, I.M., D. Qi & W. Wu. Timeliness of Annual Report Releases
and Market Reaction to Earnings Announcements in an Emerging
Capital Market: The Case of China, Journal of International
Financial Management and Accounting,vol.11(2),pp. 108131, Dec
2002.
Adobe Systems Incorporated, PDF Reference (Second Edition)
version 1.3, ISBN 0-201-61588-6, Addison-Wesley, July 2000.
Save As XML Plug-In for Windows (beta 2).[Online].Availabe:
http://www.adobe.com/support/downloads/detail.jsp?hexID=89a2
Uren, V., Cimiano, P., Iria, J., Handschuh, S., Vargas-Vera, M.,
Motta, E., Ciravegna, F.: Semantic annotation for knowledge
management: Requirements and a survey of the state of the art.
Journal of Web Semantics: Science, Services and Agents on the
World Wide Web 4, 1428 (2006)
V. Quint, I. Vatton, An Introduction to Amaya, W3C NOTE 20February-1997,
1997
(http://www.w3.org/TR/NOTE-amaya970220.html accessed on 28 July 2004).
L. McDowell, O. Etzioni, S. Gribble, A. Halevy, H. Levy, W.
Pentney,D. Verma, S. Vlasseva, Enticing ordinary people onto the
Semantic Web via instant gratification, in: Proceedings of the 2nd
International Semantic Web Conference (ISWC 2003), October 2003.
R. Schroeter, J. Hunter, D. Kosovic, Vannotea, A collaborative video
indexing, annotation and discussion system for broadband networks,
in:Proceedings of the K-CAP 2003 Workshop on Knowledge
Markup and Semantic Annotation, October 2003, Florida, 2003.
S. Handschuh, S. Staab, R. Studer, Leveraging metadata creation for
the Semantic Web with CREAM, KI 2003advances in artificial
intelligence, in: Proceedings of the Annual German Conference on
AI, September, 2003.
J. Heflin, J. Hendler, A portrait of the Semantic Web in action, IEEE
Intell. Syst. 16 (2) (2001) 5459
R. Baumgartner, R. Flesca, Gottlob G, Visual web information
extraction with Lixto, in: Proceedings of the International Conference
on Very Large Data Bases (VLDB), 2001.
Vargas-Vera M., E. Motta, J. Domingue, M. Lanzoni, A. Stutt,
F.Ciravegna, MnM: A tool for automatic support on semantic
markup, KMi Technical Report, September 2003, TR Number 133,
2003.
F. Ciravegna, A. Dingli, D. Petrelli, Y. Wilks, User-system
cooperation in document annotation based on information, in:
Proceedings of the 13th International Conference on Knowledge
Engineering and KM (EKAW02), 14 October 2002, Siguenza,
Spain, 2002.
W.J. Black, J. McNaught, A. Vasilakopoulos, K. Zervanou,
B.Theodoulidis, F. Rinaldi, CAFETIERE conceptual annotations for
facts,events, terms, individual entities, and relations, Parmenides
Technical Report, 11 Jan, 2005, TR-U4.3.1, 2005.
B. Popov, A. Kirayakov, D. Ognyanoff, D. Manov, A. Kirilov,
KIMa semantic platform fo information extraction and retrieval,
Nat. Lang. Eng. 10 (3/4) (2004) 375392
G. Gottlob, C. Koch, R. Baumgartner, M. Herzog, andS. Flesca. The
lixto data extraction project - back and forth between theory and
practice. In PODS, pages 112, 2004.
S. Flesca, S. Garruzzo, E. Masciari, and A. Tagarelli. Wrapping pdf
documents exploiting uncertain knowledge. InCAiSE, pages 175
189, 2006.
T. Hassan and R. Baumgartner. Intelligent text extraction from pdf
documents.
In
CIMCA/IAWTIC,
pages
26,
2005.

S-ar putea să vă placă și