Sunteți pe pagina 1din 8

Description of the FietsTas application

Author: Harro Stokman


Date: November 23, 2009.

Management Summary
The author investigates for Talking Trend (TT) what software exists within the
ILPS research Group of the University of Amsterdam. As such, nine applications
are recognized that are of interest to TT. One of these applications is the
FietsTas. The current document provides a detailed technical overview of the
application. The information provided is obtained through an interview with
Valentin Vijkoun from the ILPS group

The FietsTas application offers a document processing service on the Internet,


allowing organizations to request annotations based on uploaded text files. The
main purpose of the service is to generate term clouds (a graphical display of
e.g. the frequency of terms in the content) and entity lists for text documents. The
application consists of 13.000 lines of Python code, is documented, and is
actively maintained within the ILPS group.

Based on the data described in the detailed findings below, the following answers
can be given to the research questions stated by TT1:

1. What issues exist regarding intellectual property, which need to be solved


before TT applies the software commercially?
The application is closed source. A licensing agreement needs to be
signed with the UvA. The application uses other closed source modules
from the ILPS group: The Compound Splitter, NEN and SSScraper Finally,
FietsTas uses NER which in turn uses TnT. A commercial license was
requested by the author with the owner of TnT, Mr. Thorsten Brants. On
November 23, 2009 Mr. Brants replied not to have time to support
commercial licenses.

2. What quality does the software have, what should be done to improve the
quality, such that the software can be used for commercial purposes?
The NER module needs to be replaced before the software can be applied
commercially.
The main attractiveness of FietsTas is that the functional quality of the
different modules from the ILPS group can be evaluated by TT without
having to install and combine them.

1
Contract between Stokman and Talking Trends of October 2009.
Detailed findings
The current section describes 50 answers to various technical questions:

General
1 What is the name of application?
FietsTas.

2 Briefly what does the application do?


The FietsTas application offers a document processing service on the Internet,
allowing organizations to request annotations based on uploaded text files. The
main purpose of the service is to generate term clouds (a graphical display of
e.g. the frequency of terms in the content) and entity lists for text documents.

3 Is the application language specific (vlakbij, achter Centraal Station)?


Yes, the language of uploaded documents should be specified. If no specification
is available, the language is detected automatically. Supported languages are
Dutch and English.

4 In what scientific paper is the application used?


The functionality is not described explicitly in academic papers.

5 Who is the owner?


The software is developed in the ILPS group. It is not open source, although it
uses other open source applications (the mySQL database and the Named Entity
Recognizer).

6 Which UvA developers have experience with the application?


Valentin Jijkoun and Andrei Vishiuski.

7 When does their contract with UvA end?


None of the contracts ends within a year from the writing of this report.

8 What alternatives exist for the application (closed or open source)?


An alternative is the Opencolais application from Reuters. Note that the
Opencolais application works for English texts only.

9 What is the latest available version?


Unknown, the application is under active development.

Architecture
10 What is the architecture of the application?
A high level overview of the architecture is given in Figure 1. FietsTas is
implemented as a Web service using a simple client-server stateless
protocol: an application sends requests using standard HTTP POST or
GET requests, and the FietsTas service responds by sending a
standardized XML response over HTTP.

Figure 1: High level architecture of FietsTas

Interaction with other systems


11 With which other systems does the application interact?
Internally, the application uses
• SSScraper for distributed scheduling and for job processing, mySQL
database,
• Named Entity Recognizer,
• Named Entity Normalizer,
• The Compound Splitter.
In future, the application is planned to use:
• The TimexTag,
• The Stanford tagger (for every word it is assigned whether the word is a
noun, verb or adjective).
Furthermore, the application might use Lingpipe in future for sentiment analysis.
Externally, the application can interact with third party software. This is described
in the following section.

12 How does the application interact with other systems?


The internal communication is realized through software wrappers around the
individual modules, allowing to send and read data.
The external communication is realized through API’s in two ways:
• Web API: FietsTas can be accessed via REST and SOAP web services.
These web service layers serve as interfaces to the same functions of
FietsTas, so they can be used interchangeably. For the REST web
service, the user accesses FietsTas via HTTP POST/GET requests and
receives HTTP responses. For the SOAP web service, user
communicates with FietsTas via SOAP layer messages
• Software API: FietsTas API libraries are available for PHP and Python.
Internally, libraries use Web API to access FietsTas functions, so they can
be employed in any application that can access FietsTas over the internet.

Hardware, operating environment


13 What operating system does the system run on?
Linux.

14 What operating is the system developed on?


Linux.

15 Are there any platform limitations that may be reached in the foreseeable
future (e.g. maximum file size, maximum number of concurrent users?
The mySQL database has maximum number of connections, although this can
be configured. In case up scaling is required beyond the capacities of mySQL;
this may be achieved using other, commercial databases.

16 Is the application dependent on the operating system?


No, in principle the system should not be platform dependent. However, this is
not yet verified in practice.

17 Is the application hardware dependent?


No, in principle the system should not be hardware dependent. However, this is
not yet verified in practice.

Programming languages
18 Which languages are used in the system?
The FietsTas application is developed in programming language Python. The
internal components are treated as black boxes.

19 Which programming environments are used?


This is dependent on the researcher.

20 How many lines of source code are there?


The application consists of 13.000 lines of Python code and 100 lines of PHP.

Code generation
21 Are parts of the system generated?
No, source code is not generated. Also, FietsTas does not require training data.
22 How are parts of the system generated?
Not applicable (N. A.)

Data storage
23 Which type of storage is used?
Using a mySQL database.

24 Are vendor specific extensions used?


No.

25 How is the connection with the database and the marshalling of data
organized?
The application contains a special layer containing code to connect to a database
or to flat files. To access the database, standard Python libraries are used for
executing SQL commands.

User interface
26 Which kind of user interface does the system have (text, web, windows)
FietsTas provides a web interface for humans, accessible through a web
browser. The web interface allows users to manually upload documents. Only a
subset of the FietsTas functions is made available through this interface: users
can upload/list documents and generate simple annotations/clouds. This
interface can be used for testing purposes and for users to test applicability of the
system in their applications.
A simple (password-protected) web interface is also provided for FietsTas
developers. This interface allows developers to visualize internal objects of
FietsTas. It can also be used for debugging.

27 Is there tool support for the user interface?


PHP development environment.

28 How is the connection between the user interface and the rest of the
application organized?
Using the programming language PHP.

Reporting
29 Are reporting facilities available in the application?
There is a monitor available which is written in Python

30 Which tools are used for reporting?


N. A.
Performance demands
31 Are performance demands described?
Performance requirements are currently defined by Mr. Vishiuski. This is part of
the Bridge project.

32 On what size/speed of hardware should the application run?


FietsTas is parallelizable where several instances run on multiple machines.
These servers currently use QuadCore processors with 8 GB of memory.
FietsTas also runs on a dedicated server with 20 GB of memory with 4
processors.

33 What is the current size of the data that can be handled by the
application?
This is not known explicitly. The FietsTas application is used extensively in
projects as Duoman, TNT (tracking of events in news), and Bridge (a project
together with Beeld en Geluid).

34 Is the application a multi-user system?


Yes, users are supplied with an API key, in order to prevent access to data from
other people.

35 What is the current / max number of concurrent users?


Unknown.

Documentation
Is any of the following documentation available?

36 Architecture description
An overview of the architecture is given in the Duoman document, Data collecting
and indexing infrastructure, STEVIN internal report, November 2008.

37 Functional documentation
Functional documentation is available. The document describes
• how users can obtain keys to use the service,
• the type of documents that can be and should be uploaded,
• the type of request that user scan ask FietsTas to perform
• How a user can provide feedback (this functionality is work in progress).

38 Technical documentation
At http://zookst5.science.uva.nl:8080/FietsTas/ there is developer
documentation. This link is not available from outside though.

39 How up to date are above documents


The Duoman document is from November 2008. The last change for the user
requirements are from July 2009.
40 Is there a bug reporting system?
Yes, using Trac.

41 Are tools used to support the documentation?


Yes, using Trac.

Configuration management
42 Which version control system is used?
SVN.

43 Are change requests logged?


Yes, using Trac.

Source build process


44 Is the build process automatic?
No, although a Make file exists for database creation.
.

45 How typical is the build process (i.e. would new developers know how this
works?)
There is a readme for step by step instruction. Installation by Valentin would take
a few days

Deployment process
46 Is the deployment process automatic?
No.

Testing
47 Does an automated daily test run exist?
No.

48 Are there unit test?


Yes, 2.500 lines of test code.

49 Are there regression tests?


No.

50 Is any stress testing performed?


No.

51 Are the test performed automatically?


No.