Sunteți pe pagina 1din 50

Abstract

With the advent of powerful search engines, that can provide results semantically relevant to the user's
queries, finding and taking any piece of information has become extremely easy. This has become a major
issue, since students, writers, or even academic people use chunks of information in their works, without
providing proper citation to the original source. This poses a major risk, since plagiarism is considered a
copyright infringement, hence a fraud.

This work intends to show research in building a plagiarism detection system. Multiple methods are
presented, with advantages and disadvantages, and technologies being used. The emphasis will be put on
the plagiarism detection system, and not mainly on the algorithm.

This research aims to show how a plagiarism detection system can be built by using well known
technologies, and combine them to find out whether a document is considered plagiarized or unique. This
system relies on a database of normalized documents, against which comparisons will be made. This
software could be used in the education system to detect if the paperworks are unique, encouraging new
ideas, and discouraging content copying and idea theft.

1
Rezumat
Odată cu dezvoltarea motoarelor de căutare performante, care returnează rezultate semantice relevante la
cautările utilizatorului, găsirea si utilizarea informației a devenit extrem de simplă. Acest fapt a devenit o
problemă majoră, deoarece studenții, scriitorii, sau chiar si oamenii de tip academic foloses bucăți de
informație in lucrările lor, fără să adauge citarea corectă la sursa originală. Aceasta ridică o problemă
majoră, deoarce plagiarizmul este considerat o încălcare a drepturilor de autor, și deci este considerat o
fraudă.

Această lucrare intenționează să arate o cercetare în crearea unui sistem de detectare a plagiarismului.
Mai multe metode sunt prezentate, cu avantaje și dezavantaje și tehnologiile utilizate. Se pune în evidență
crearea sistemului de detectare a plagiarismului, și nu cercetarea algoritmului principal.

Această cercetare are ca scop să arate cum un sistem de detectarea a plagiarismului poate fi dezvoltat
utilizând tehnologii bine cunoscute și a le combina pentru a detecta dacă un document este considerat
plagiat sau unic. Aceset sistem se bazează pe o bază de date de documente normalizate, cu care se fac
comparațiile. Acest soft ar putea fi utilizat in sistemul de educație pentru a detecta daca lucrările
studenților sunt unice și să încurajeze ideile noi, dar și într-același timp să descurajeze plagierea
informației străine.

2
Introduction
Over the past few years, the information resources have more than tripled in size thanks to the advent of
the World Wide Web. It is extremely easy now to find a variety of information, on different subjects, from
different authors, and even the same information from different sources. While this has its positive
aspects (such as information freedom), there are also the negative aspects. Thing is, the publicly available
information was aimed to serve as a personal cognitive material, and not to be used in personal work,
where uniqueness claim is made.
“Plagiarism is the practice of taking someone else’s work or ideas and passing them off as one’s own.” -
New Oxford American Dictionary
As mentioned in an article from the University of Southern Queensland [1], plagiarism can occur from
different causes. Some of them are misusing resources, without providing proper citation; copying and
paraphrasing blocks of text; or even abusing the limits allowed on group works. Generally speaking,
plagiarism as a term, is not applicable to textual documents only. Plagiarism can also be found in the
software industry, when code is used without providing proper source identification and credit to the
original author; plagiarism can also be found in the graphic industry, where something like icons, colors,
or text used in a graphic product can be considered plagiarized if it resembles very much another more
early work. This research will focus on text plagiarism only.
Peter Joy and Kevin McMunigal [2] have reflected on the problem of plagiarism in the academic
environment. According to them, a major grading criterion for a work is its originality, and when someone
submits for grading a paperwork with ideas prevalent to other authors, that can be considered a “cardinal
sin”. A paperwork needs to represent the personal findings and ideas of its author, since this represent the
individuality of a person. Copying someone else's idea, and claiming that to be yours, is like trying to
impersonate someone, which is unethical and unlawful.
There could be several ways to combat the plagiarism phenomenon: to prevent it, or to detect it and take
the appropriate measures. Preventing plagiarism has more to do with measures taken by teachers, in
academic institutions, who should make sure that students are aware that plagiarism is a bad habit and
that they can be punished for it. However, preventing something from happening is very difficult, and
impossible to make sure that people conform to the rules. If the process cannot be controlled, the end
product can easily be. For that reason, plagiarism detection is an effective method to make sure that a
work is unique and reflects the person who created it.

3
1 Domain analysis
With the rapid growth of the World Wide Web, more and more information has become freely and
publicly available for any knowledge seeker. Unfortunately, this has brought the plagiarism phenomenon
to pose even higher risks in different environments. One of the most affected environment is the academic
institutions.
Education is a double-sided, where one one side the student does research and learns from it, and on the
other side, the teacher grades this knowledge. When a student obtains ready made information for an
assignment and uses that without diving into the essentials of it, he does not learn anything new, and does
not learn to build a critical thinking. On the other side, it turns out that the teacher's assesment is
erroneous, and does not reflect the correct level of education. This can have some serious implications in
the long term, since generations come and go, and the upcoming students are taught by those who used to
practice the bad habits of plagiating content. This defies the purpose of humanity – constant evolution.
According to a study, in American secondary schools, summing up the different types of plagiarism, that
number reaches 90% [3]. There is a more recent study, and a tentative to solve the plagiarism problem in
Republic of Moldova [4]. According to a preliminary test with a plagiarism detection tool, there were
cases where research papers matched with 100% similarity percentage, but some others gave a positive
alarm with 40-50% similarity. As of this time, the software seems to have not been adopted by the
academic institution in cause (ASEM).
Unfortunately, every year, the number of research papers that contain plagiarized content grows with big
number. Teachers cannot cope with manually detecting plagiarized content, since the volume of papers is
big, and the sources of information are constantly growing, therefore it is impossible for a human to keep
track of all papers and spot out the ones that are plagiarised. There is definitely the need to use an
automatic detection system, that can both grow its database of document, and detect with ease the
similarity between them.
There are a series of available tools for plagiarism detection. Among all of them, there is one distinct
service, called Turnitin [5] that is the leader in plagiarism detection services. The mechanisms behind this
service are kept in secret, and they barely offer any insights how the system works behind the scenes. One
thing to be noted by this system, is that they store student works in their database, so that they can use that
for future testing. According to their statistics, more than 337 million student works have been added to
their database, mostly added by academic institutions, so that clearly serves as a reminder that the
education system needs a plagiarism detection tool to cope with duplicated content among student works.
Below are some well known plagiarism detection systems, compared by features:

4
Table 1.1: Existing solutions comparison
System name Searches the Searches in local Performs text Provides detailed
Internet? database? normalization? information?
Advego Plagiatus Yes No No Yes
Istio No Yes No Partially
Miratools Yes No Yes Yes
Plagiat-inform Yes Yes No Yes
Praide Unique Yes No No Yes
Content Analyzer II
Proposed solution No Yes Yes Yes

By examining the table 1.1, we can notice that most plagiarism detection systems rely on searching the
Internet for duplicates and content similarity. Most of the time they use common search engines to
perform these searches (Google, Bing, Yahoo! Search). The main drawback with this solution is that
search engines are meant to search for keywords, and not chunks of full text. Moreover, most of them
protect against multiple similar requests by either temporarily blocking access to their services, or
enforcing a captcha test. Therefore, such systems need to also implement sophisticated proxy connections,
which turn out to be unreliable and not scalable.

5
2 System architecture
2.1 System vision
2.1.1 Problem statement
It is easily understandable that the rate at which the plagiarism cases arise is growing at an alarming
pace. According to an article from the Open Education Database (December 19, 2010), an informal poll
from 2007 revealed that 60.8% of polled college students have admitted cheating, and 16.5% didn't regret
that. Now the disturbing news is that, according to the article, cheaters got higher grade marks than those
who study honestly. This is a bad sign, since this can have a discouraging effect over those students who
do their work honestly, and also, the grading is kept artificially high by people who did not deserve their
marks. Moreover, it is hardly possible to manually detect if a work is plagiarized due to the high number
of students and works submitted. A need for an automatic plagiarism detection system is apparent. This
can easily search across massive amounts of information to tell if a document is plagiarized or not. The
beneficial effect of a successful plagiarism detection tool is that it will hopefully discourage students from
practicing this bad habit, and encourage creating unique, quality content.
2.1.2 System objectives
• Implement an algorithm that can identify whether a document is considered plagiarized, based on
a similarity percentage. The “plagiarized” status will be given to a document, based on
experimental similarity percentage findings.
• Have the ability to check a document against a database of existing documents.
• Show similarities between the two documents,, to better understand what words, phrases,
paragraphs have been plagiarized.
• Have a good performance index. The system needs to perform well with thousands of documents
stored in the database.
The system will allow to determine whether a document is duplicated or not. This decision will be made
by checking the document content against a series of locally stored documents by using a fast performing
text similarity algorithm. If the document similarity index is higher, this means that one of the two
documents might contain plagiarized content. The person using this system, will then have the ability to
see what portions of the document are found in the other document. The technique used for this is called
text-diff, and allows to see where different words have been used in place of the original ones, to trick the
plagiarism detection software.
2.1.3 Identifying the stakeholders
The stakeholders represent all parties that can affect the development of this system. By making an
analysis of the people who are going to use this system, it can be concluded that there are 3 parties. First

6
of them are the users, which in our case could be teachers, academic institutions, students, or whoever
needs to check a specific document against plagiarism. This party will be using the system actively, and it
is them who can provide additional feedback for continuous system improvement. The second party are
the authors, or people whose works are checked against plagiarism. These people do not use the system
directly (unless they need to check their works for uniqueness for personal reasons), but provide a
substantial contribution to the whole plagiarism detection system. The contribution that they provide lies
in their works that are added to the local database. The bigger this database gets, the better results can the
system provide and the easier it is to spot out plagiarized documents. To be noted that the efficiency of the
system described in this paper, highly relies on its arsenal of stored information. Subsequently, the third
identified party are the developers who maintain the system. It is their task to continuously research and
improve the algorithm used for the plagiarism detection, so that the stored information can be analyzed in
the best way.
2.1.4 Documenting the functional requirements of the system
The functional requirements of a system describe the functionality and components that need to be
present in the application described. By analyzing the system objectives, several requirements can be
spotted out:
• Upload document
• Convert the document format to a single format
• Normalize the input text
• Store document in local database
• Check for similarities against other locally stored documents
• Get a score based on the similarity test
• See a text difference with similar documents
The first requirement of this system is to allow the person who uses it to upload a document into the
system. Once the document has been uploaded, it needs to be converted to a specific format (text/html in
our case). Without this step, it would be impossible (or hardly possible) to perform further operations on
the text. Further down the road, the system needs to take care of the content, specifically the text. Several
normalization techniques are applied to clean out unnecessary characters (such as non-latin characters),
images, tables, or other content that is hard to parse and analyze; also, stop words are removed (stop
words are words defined as too general to take into consideration for uniqueness, such as function words
“at”, “this”, “the”, “not”, etc). Once the text has been carefully normalized, we need to store it in the
database, so we can avoid performing the aforementioned steps again when we need to check for
similarity with other documents. Another requirement is the similarity test; this test makes use of the text

7
analysis algorithm to find out how similar two documents are. Based on this similarity test, we can get a
similarity score (or index) in percentage. If the score is 100%, this means that the two documents are
duplicates; therefore, the lower the score is, the more this means that the two documents are unique and
original. The final requirement in our list is the text difference feature, which allows the person who is
performing the test, to see the actual textual similarity between the two documents. This is presented in a
text-diff manner, that shows in green the text that is present in the other document, and in red/purple the
words/phrases that are different between the two documents. Based on this text-difference test a person
can see if someone tried to replace words with their synonyms to trick the system in believing that they
work is unique and original.
2.1.5 Documenting the non-functional requirement of the system
Non-functional requirements are the other requirements that don't do anything specific, but are
important characteristics of the system that can have a positive/negative impact over the quality of the
system. By analyzing today's standards, several requirements have been identified:
• Performance
• Usability
• Reliability
• Modifiability
• Maintainability
• Accuracy and precision
From the proposed list of non-functional requirements, two of them are of higher importance for this
system type. These are the performance and accuracy requirements. It is very important that the system
can provide fast results, considering that the number of documents added to the database is constantly
growing, and the time needed to perform plagiarism check, also grows linearly. Usually, algorithms that
provide the best results perform worse, so a balance between performance and reliability needs to be
found. The other important requirement, is the accuracy one. It is mandatory that the system be accurate
in its results, since the number of documents is high, so receiving a large set of irrelevant results can lead
to frustration and cease to use the system. Also the system needs to be maintainable and modifiable, since
development on it should be continuous by nature, therefore development on it should feel like a breeze.
2.2 Architectural representation
2.2.1 Use Case Diagrams
To represent the system components, interaction between them, and the parties that will interact with
the system's functionality, UML diagrams will be used for simplicity and convenience. In figure 2.1, the
system's use case diagram is represented that represents the system features and functionality. A user can

8
perform several actions with the plagiarism detection system, such as uploading a document to be stored
in the database. But before storing this document in the database, the document is converted to a
predefined format, and the content of this document is normalized. Next, the user can choose to check
whether a document is plagiarized; they can select to use one of the defined algorithms: normalized
compression distance or term frequency - inverse document frequency algorithm. They can also see the
text difference of two stored document, that will be shown in a manner similar to how Git works,
representing the content that is common to both documents, or show where terms have been replaced with
other terms (presumably synonyms).

Figure 2.1: System's Use Case Diagram


The following diagram (figure 2.2) represents the actors that interact with the system and actions they can
perform.

9
Figure 2.2: Actors interacting with the system and actions they can perform
The end user can be anyone of simple users, teachers, students, or academic institutions that need to
check a document for uniqueness. The end user can upload a new document into the system's database
(unless an exact copy is already present, it will show that the uploaded document is a duplicate), find
other documents that are similar in content with their document, and see the text difference of two similar
documents.
2.2.2 State Diagram
State diagrams (also called State Chart diagrams) are used to better understand any complex / unusual
functionalities or business flows of specialized areas of the system. State diagrams depict the dynamic
behavior of the entire system, or a sub-system, or even a single object in a system. After examining the
proposed system and the states of users actions in the system, the diagram in figure 2.3 has been
elaborated. Before performing any action on the system, the user needs to authenticate themselves in the
system, that is log into the system. Then the user can choose several ways to interact with the system.
They can upload a new document to the database, delete their existing documents, or see text difference
between two documents. After they have finished interacting with the system, they can choose to log out
of the system.

10
Figure 2.3: System's state diagram
2.2.3 Chosen architectural patterns
Architectural patterns is a way of solving recurring architectural problems. They promote the code
reuse, and follow conventions to encourage the habit of writing good code. As described by R. N. Taylor
[4]:
“An Architectural Pattern is a named collection of architectural design decisions that are applicable
to a recurring design problem parameterized to account for different software development contexts
in which that problem appears.”
The benefits of using architectural patterns is that they provide a common language for developers. This
allows to spend more time on extending the application with features rather than get into specifics of how
some functionality should be better built. Every architectural pattern permits you to achieve a specific
global system property, such as the adaptability of the user interface. In this sub-chapter will be described
the most commonly used architectural patterns. The Client/Server pattern segregates the system into two
applications, where the client makes requests to the server. In many cases, the server is made of a
database with application logic represented as stored procedures. The Component-Based Architecture
pattern decomposes application design into reusable functional or logical components that expose

11
well-defined communication interfaces. The Domain Driven Design architectural pattern is an
object-oriented architectural style focused on modeling a business domain and defining business objects
based on entities within the business domain. The Layered architecture pattern partitions the concerns of
the application into stacked groups (layers). The Message Bus patters is an architecture style that
prescribes use of a software system that can receive and send messages using one or more communication
channels, so that applications can interact without needing to know specific details about each other.
N-Tier / 3-Tier architecture segregates functionality into separate segments in much the same way as the
layered style, but with each segment being a tier located on a physically separate computer. The
Object-Oriented pattern is a design paradigm based on division of responsibilities for an application or
system into individual reusable and self-sufficient objects, each containing the data and the behavior
relevant to the object. The Service-Oriented Architecture (SOA) refers to applications that expose and
consume functionality as a service using contracts and messages. Most of the times, the architecture of an
application is the combination of several architectural patterns that make up the complete system.
Analyzing the proposed system and choosing from the best suitable architectural patterns, the
multi-layered architectural pattern has been chosen. This architectural pattern uses several layers to
separate the different responsibilities inside the software system. The framework upon which this system
is built on has the MVC pattern at its core which adheres to the layered architectural pattern.
MVC stands for Model-View-Controller. MVC separates domain / application / business logic from the
rest of the user interface. It does this by separating the application into three parts: the model, the view,
and the controller. The model manages fundamental behaviors and data of the application. It can respond
to requests for information, respond to instructions to change the state of its information, and even to
notify observers in event-driven systems when information changes. This could be a database, or any
number of data structures or storage systems, but is not limited to just storage systems. The view
effectively provides the user interface element of the application. It will render the data from the model
into a form that is suitable for the user interface. The controller performs the logic part of the application,
it receives user input and makes calls to model objects and the view to perform appropriate actions.
The architectural principles guide the design of a system (business system, information system,
infrastructural system). They set the rules for design decisions where business criteria can be translated
into language and specifications. The principles require developing a framework that includes appropriate
policies and procedures to support their implementation. Architectural principles respected by this
application are:
• High cohesion;
• Loose coupling;

12
• Separation of concerns;
• Information hiding;
• Liskov Substitution;
• Inversion of control;
• Interface segregation;
• Modularity;
• Design for change;
• Convention over configuration.
Well-defined responsibility boundaries for each layer and the assurance that each layer contains
functionality directly related to the tasks of that layer will help to maximize cohesion within the layer.
Communication between the layers of the system is based on abstraction and this provides loose coupling
between the layers. Each layer has predefined tasks and concerns. The separated Presentation patterns
divides UI processing concerns into distinct roles. Since the layers have specific responsibilities, no
assumption need to be made about data types, methods and properties, or implementation during design,
as these features are not exposed at layer boundaries, thus assuring information hiding. The Liskov
substitution is assured by the separation between functionalities in each layer is clear. Lower layers have
no dependencies on higher layers, which make replacement of one layer quite possible. Due to the fact
that there is only a single layer (which is an interface to the needed functionality) the interface segregation
principle is also fulfilled by the system’s architecture. The modularity of the system is achieved because
each layer in the system is a module and the layers interact with each other through well-defined
interfaces. Since everything is divided into modules/layers, it is easier to add new functionality (without
affecting the behavior of entire application), therefore ensuring that system is designed for change. The
principle convention over configuration is guaranteed by using naming convention between data layer and
logic layer.
2.2.4 Architectural sketch
In figure 2.4 we have represented the architectural pattern that we mentioned to be used in our system
earlier. On the top level is the presentational layer, which is responsible for presenting the information to
the user. The presentational layer represents an user interface that the user can use to interact with our
application. It is responsible for templating, result caching, returning the correct result format based on
the request made. Also, this is how our system can interact back with the user (for example, returning
results, or asking for more user input). This layer does not contain any business logic, but only represents
the data sent from the lower layer, which is the business layer. The business layer coordinates the
application, routes the user requests to controllers that in return process the request and returns the result

13
to the presentational layer. The business layer lays between the presentational layer and the data access
layer, and interacts more with the latter. The latest layer in our list is the data access layer where the
information is stored and fetched from the storage medium (a database for this system). The only layer
that has direct access to the information in no abstract way, is the data access layer. Also, it is here where
data is being cached to speed up similar queries in short periods of time.

Figure 2.4: The architectural sketch of the system


2.3 System design
2.3.1 Class Diagram
The Class Diagram provides an overview of the target system by describing the objects and classes
inside the system and the relationships between them. It provides a broad variety of usages; from
modeling the domain-specific data structure to detailed design of the target system. With the share model
facilities, it is possible to reuse the class model in the interaction diagram for modeling the detailed design
of the dynamic behavior. The Form Diagram allows to generate diagram automatically with user-defined
scope. A UML class diagram is similar to a family tree. A class diagram consists of a group of classes and
interfaces reflecting important entities of the business domain of the system being modeled, and the
relationships between these classes and interfaces. The classes and interfaces in the diagram represent the
members of a family tree and the relationships between the classes are analogous to relationships between
members in a family tree. Interestingly, classes in a class diagram are interconnected in a hierarchical
fashion, like a set of parent classes (the grand patriarch or matriarch of the family, as the case may be) and
related child classes under the parent classes. In figure 2.5 we have represented the class diagram of the
proposed application with three classes: the document, user, and eloquent classes. The document class
represents the “document” object which has a variety of of properties (id, normalized_text, created_at,
updated_at, hash) and methods (find, fill, boot, all, findorfail, delete, etc). The “id” property represents
the unique identifier of the document. When the document is stored in the database, a new unique ID is
assigned to it. We use the ID all the time, from fetching the content, details, timestamps of the document,
to finding similar documents, and seeing the text difference of two documents. The “normalized_text”
property represents the normalized content of the document, after it was filtered from unnecessary words,
specific markup language tags, tables, and unnecessary symbols. Basically this is the most important

14
field for a document in our system, because it contains the information that we use to find whether a
document is plagiarized or not. The “created_at” and “updated_at” fields are time stamps, and hold the
creation time and when changes are made to the document's state. The “hash” field, which is of type
string, represents the md5 hash of the “normalized_text” field, and is used to avoid storing a duplicate
document twice. The hash field permits to check if a the same document is already stored in the database
much faster, since it is also an index and is much shorter than the “normalized_text” field. The methods of
the “document” class are actually inherited from the base class “Eloquent” and will be described later.
The user class represents the “User” object and represents the actor information that actually interacts
with our system. The “id” field is used for the same purpose as described earlier. The “name” field
represents the name of the user, and is only used to visually identify the currently logged in user in the
system. The “role” field describes the role of the user in the system. It can be a simple user, that can only
use publicly open system functionality and has limited actions that can modify the data. On the other
hand, there is the administrator role that has broader privileges, and can add new users, delete users,
manage documents of any user, perform chore tasks such as cache pruning, database optimization, and re
building the normalized_text field (in case we improve the normalization algorithms). The “Eloquent”
class is a base class that serves as a starting ground for most of the classes in our system. It contains the
common properties and methods to interact with the data layer. Some of the important methods of this
class are “find”, which is used to fetch an object from the database based on the ID; the “delete” method
used to delete an object from the database; “where” - which is used to find objects based on a conditional
query. Also, represented in the Class Diagram is the relationship between the “User” class and the
“Document” class. The relationship is one to many (1..*), which is 1 “User” to many “Documents”. This
means that an user can have many documents that belong to them, but no document can belong to more
than a user.

15
Figure 2.5: The Class Diagram

2.3.2 Component Diagram


A component is a modular unit that is replaceable within its environment. Its internals are hidden, but
it has one or more well-defined provided interfaces through which its functions can be accessed. A
component can also have required interfaces. A required interface defines what functions or services it
requires from other components. By connecting the provided and required interfaces of several
components, a larger component can be elaborated. A complete software system can be taken as a
component. Component diagrams are used to model physical aspects of a system. In figure 2.6 it is
represented how the components in the proposed system interact with each other. From the diagram we
can observe that the system is built of users who interact with our application, the database that stores the
information, and the server that is used to serve the requests from the user to the application, and vice
versa- from the application to the user. The user interacts with the system by the application interface. The
application interface serves as a bridge between the data component and the server component. The data
component is responsible for managing the system's information (storing, fetching, processing, etc). The
server component is used to process the requests so that the application can map those requests to the
according controllers.

16
Figure 2.6: The Component's Diagram

2.3.3 Deployment Diagram


The Deployment Diagram is a structure diagram which shows architecture of the system as
deployment (distribution) of software artifacts to deployment targets. Artifacts represent concrete
elements in the physical world that are the result of a development process. Examples of artifacts are
executable files, libraries, archives, database schemas, configuration files, etc. Deployment target is
usually represented by a node which is either hardware device or some software execution environment.
Nodes could be connected through communication paths to create networked systems of arbitrary
complexity. Figure 2.7 represents the deployment diagram of the proposed system. The Web Server is the
execution environment for our system, it contains the application and processes and executes the
incoming requests. The user's browser represents the user client, and it is used to make requests to our
application via TCP/IP by hyper text transfer protocol on port 80. The request router component then
processes this request and calls the appropriate controller to execute the request and return a result. The
application server also makes use of the database server node, that holds all the information and in this
case is represented by a MySQL 5 database. The communication between the application server and the
database node is made across TCP/IP.

Figure 2.7: The Deployment Diagram

17
3 Technologies used
In order to developer a product, or service it is necessary to find the right tools and technologies that
will alleviate the development, cut on costs, and speed up development time. There are a wide variety of
technologies to use for the given system, and in the end it comes to one's own preferences, experience
background, and goals.
3.1 Chosen technologies
There are many factors to take into consideration when making a decision on what tools and
technologies to use when developing a product or service. A stand-alone implementation is beneficial
when the end product needs to perform a single task, with little dependencies on other tools and services.
Text processing applications are a good example of stand alone implementation, because such
applications start quickly on ones computer, and are always there, installed, ready to boot and start
processing commands. Also, media players are also a good example of stand-alone implementation since
sometimes Internet connectivity might be missing, so online playback is not possible. Another software
implementation is the client-server one. Such implementation is used when the software solution implies
a server that runs the main application, and a client computer that is used to run the requests. The
web-based option is when the main application is installed on a remote server that contains all the
computational logic, and specially built client applications or just internet browsers are used to make use
of the application installed on the remote server. This is a perfect solution in these days, since everything
tends to move to the cloud, and the user only needs to have a powerful enough devices to be able to send
requests and process results and visualize these results on the screen. Good examples of web based
applications are Facebook, SoundCloud, or for example the package tracking software used by DHL. Web
based solutions are convenient because all data is stored on the company's servers, and also the
algorithms, and the source code of the applications is not exposed at all (not even in binary format, like in
the case of stand-alone applications). Also, web based solutions are more convenient from the economic
point of view because it is easier to monetize such software and control licensing. The next point to be
considered is the construction of the system. There are four constructions options that can be applied to
the existing systems. The standard product represents solutions that are available as standard products.
The product encapsulates a set of built-in features, much like a pre-fabricated building. If the features
available in the product are a good fit for the requirements of the business processes and policies, a
standard product may be a good choice. Customized construction is when several standard products have
facilities to configure and customize features, requiring varying amount of expertise. The amount of
customization required to achieve the required facilities is the prime consideration to choose this option.
The custom-built solutions have a high degree of flexibility and can be built to suit the exact requirements
of the business. However, the costs and risks can be higher. New development methodologies like Agile

18
Development can minimize these costs and risks. The last type of construction is the open source
solutions. Open Source applications have emerged as a new option for constructing IT solutions. Built by
teams of programmers who collaborate by volunteering their efforts without compensation and for mutual
benefit, several open source applications have grown to become feature-rich products with complete
source code available to any user. Enterprises can use open source applications with low upfront
investments, and with relevant technical skills, can customize the code to match their requirements. Open
source alternatives for basic requirements like email access, web site management, basic collaboration,
etc. are now easily found, but applications for business process management are few and not yet mature
enough. However, this is a space to watch. One factor that has to be taken into account when choosing the
technology is the budget of software and its type of license. Another important factor is the
internationalization support. Considering these factors mentioned and the purpose of the system, the
technologies have been chosen accordingly.
3.1.1 MySQL 5
MySQL 5 has been chosen as the database management system. There are many database
management systems (DBMS) on the market which provide the same capabilities (at some extent). In the
process of choosing the desired management system, a comparison was performed. As shown in table
3.1.1, the DBMS chosen for examination where: Microsoft SQL, Oracle database, PostgreSQL, MySQL
and IBMs DB2. The next terms have to be explained: ACID (Atomicity, Consistency, Isolation,
Durability) is a set of properties that guarantee that the database transactions are processed reliably. All
the DBMS support it. Referential integrity is a property which, when satisfied, means that every value of
one column of a table needs to exists as a value of another column in a different or the same table. For the
referential integrity to hold in a relational database, any field in a table that is declared a foreign key can
contain either a null value, or only values from a parent table's primary key or a candidate key. In other
words, when a foreign key value is used it must reference a valid, existing primary key in the parent table.
It is also supported by all DBMS, with an exception in MySQL – it partially supports it. There are
differences at of OS support level – MSSQL naturally runs on Windows, while PostgreSQL is available
on Windows, Mac, Linux, UNIX, BSD and Android. MSSQL and DB2 have a limited database size,
while PostgreSQL, MySQL and Oracle are unlimited. When considering the database capabilities, it can
be noticed that MySQL has a wide support for different operation systems, and does not lack many of the
other's DBMS capabilities. The same applies to the indexes. MySQL also support them without any
limitation, while other system applied constraints or lack some of the indexes specified. Considering other
type of objects, data types, foreign keys and associate operations (i.e. cascade delete) – MySQL support a
wider range than other systems.

19
Table 3.1.1: Database management system comparison

MSSQL Oracle PostgreSQL MySQL DB2

1 2 3 4 5

Win, Mac, Linux,


Win, Mac, Win, Mac, Win, Mac,
Unix, BSD, z/OS
OS Win Linux, Unix, Linux, Unix, Linux, Unix,
AmigaOS,
z/OS BSD, Android z/OX, iOS
Symbian

ACID Yes Yes Yes Yes Yes

Referential
Yes Yes Yes Partial Yes
Integrity

Transaction Yes (except Yes (except for


Yes Yes Yes
s DDL) DDL)

Unicode Yes Yes Yes Yes Yes

Max DB
524,272 TB Unlimited Unlimited Unlimited 512 TiB
size

R-/R+ tree,
Hash Hash (Cluster R-/R+ tree,
R-/R+ tree
(Non/Cluster tables), Hash,
(MyISAM tables Expression,
& fill factor), Expression, Expression,
only), Hash Reverse,
Indexes Expression, Partial, Partial, Reverse,
(MEMORY, Bitmap,
Partial, Reverse, Bitmap, Gist,
Cluster, InnoDB, Full-text
Fulltext, Bitmap, GIN, Full-text,
tables only)
Spatial Full-text, Spatial(PostGIS)
Spatial

Capabilities Union, Union, Union, Intersect, Union, Inner Union,


Intersect and Intersect, Except, Inner joins, Outer joins, Intersect,
Except(2005 Except (via joins, Outer Inner selects, Except, Inner
and beyond), MINUS), joins, Inner Blobs and Clobs joins, Outer
Inner joins, Inner joins, selects, Merge joins, Inner
Outer joins, Outer joins, joins, Blobs and selects, Merge
Inner selects, Inner selects, Clobs, Common joins, Blobs

20
and Clobs,
Merge joins, Merge joins,
Table Common
Blobs and Blobs and
Expression, Table
Clobs, Clobs,
Windows Expression,
Common Common
Functions, Windows
Table Table
Parallel Query Functions,
Expression Expression
Parallel Query

Data Domain
Data
Data Domain, (via Check
Domain ,
Cursor, Data Domain, Constraint),
Cursor, Cursor, Trigger,
Trigger, Cursor, Trigger, Cursor,
Other Trigger, Function,
Function, Function, Trigger,
objects Function, Procedure,
Procedure, Procedure, Function,
Procedure, External routine
External External routine Procedure,
External
routine External
routine
routine

Subset of
Native data
SQL'92 types
types, including
plus specific
boolean, money,
types. Some
date-time and Broad subset of
SQL'92 types
numeric types. SQL’92,
Data types are mapped
SQL'92 data including SQL’92
into Oracle
types syntax are numeric types
types. No
mapped directly
boolean type
into native
nor
Postgres types.
equivalent.

3.1.2 Laravel 4 Framework


Laravel is a PHP framework built around the latest architectural patterns, and code conventions.
Because it has a minimum requirement of PHP 5.3, it brings in all the goodies of PHP 5.3 and up, such as
namespaces, quick array creation, traits, closures, and others. Laravel focuses on making development
fun, yet it has a strong foundation and code conventions that make the end product quality resistant in

21
time and easy to maintain. Laravel has everything needed to build a modern web application. It has
support for active record style database interaction through its Eloquent base model class. In listing 3.1
we can see how easy it is to manipulate with an object in Laravel. The Task object in this case, is actually
represented by objects stored in the database. Thanks to an Active-Record model style implemented in
Laravel, there is no need to write any single line of raw database queries, since the Eloquent model maps
everything to the fields in the database and abstracts the methods of storing and retrieving the
information.
// Fetch all tasks
$tasks = Task::all();

// Fetch the task with an id of 1


$task = Task::find(1);

// Update a task
$task = Task::find(1);
$task->title = 'Put that cookie down!';
$task->save();

// Create a new task


Task::create([
'title' => 'Write article'
]);

// Delete a task
Task::find(1)->delete();

Listing 3.1 – Working with Eloquent objects


Another nice feature of Laravel is its easy route mapping functionality. It is possible to process requests
inside closures, create resourcefull controllers, or use RESTfull controllers to map requests to their
respective methods in the controller. The good thing about Laravel's routing implementation is that it
allows to specify filters, that help to protect routes against unauthorized access. It is also possible to map
URL parameters to parameters used in the functions that process the request. Listing 3.2 shows how easy
it is to build a route for processing a user's profile page.
Route::get('users/{id}', function($id) {
// find the user
$user = User::find($id);

// display view, and pass user object


return View::make('users.profile')
->with('user', $user);

22
});

Listing 3.2 – Request routing in Laravel


3.1.3 Composer
Composer is a tool for dependency management in PHP. It allows to declare the dependent libraries a
project needs and it will install them in the project automatically. Composer is extremely convenient to
use when a project depends on many third party libraries, because it allows to leverage the time needed to
check for library updates, performing the updates for individual libraries, keep track of any library's
library dependencies. Composer does all this with a single command:
php composer.phar update

Composer is built with PHP and because of the huge PHP community, has given life to a lot of publicly
accessible packages on websites like packagist.org and others. The good thing about this tool is that it
encourages code reuse and the development of modular applications. Thanks to this, a module can be
re-used in other applications, thus cutting on development time, cost, and maintenance. Composer's
dependencies are defined in a composer.json file, that contains a JSON structure. An example is
provided in listing 3.3, that shows how a configuration file can be built to require two packages inside a
project. Also, Composer will built a special autoloading file, that can be used inside an application to lazy
load classes only when there is need to instantiate an object of that class type.
{
"require": {
"illuminate/foundation": ">=1.0.0",
"mockery/mockery": "dev-master"
},
"autoload": {
"classmap": [
"app/controllers",
"app/models",
"app/database/migrations",
"app/tests/TestCase.php"
]
}
}

Listing 3.3 – Composer's configuration file


3.1.4 Apache 2 Server
Apache 2 is generally recognized as the world's most popular Web server (HTTP server). Originally
designed for Unix environments, the Apache Web server has been ported to Windows and other network
operating systems. The name "Apache" derives from the word "patchy" that the Apache developers used
to describe early versions of their software. The Apache Web server provides a full range of Web server
features, including CGI, SSL, and virtual domains. Apache also supports plug-in modules for

23
extensibility. Apache is free software, distributed by the Apache Software Foundation that promotes
various free and open source advanced Web technologies. Apache is very easy to configure and extend its
functionality. Listing 3.4 shows how a new virtualhost can be set up from an Apache .conf file.
# Ensure that Apache listens on port 80
Listen 80
# Listen for virtual host requests on all IP addresses
NameVirtualHost *:80

<VirtualHost *:80>
DocumentRoot /www/example1
ServerName www.example.com

# Other directives here

</VirtualHost>

<VirtualHost *:80>
DocumentRoot /www/example2
ServerName www.example.org

# Other directives here

</VirtualHost>

Listing 3.4 – Apache configuration file, configuration of VirtualHosts


Apache also supports the .htaccess file which can be used to configure how requests should be treated.
Thanks to this configuration file, it is possible to create pretty URLs, links that look good in the browser,
but which actually map to a dynamic script that receives all parameters. Also, it is possible to enable
modules such as gzip output compression, cache expiration management, DoS attack prevention, etc.
<IfModule mod_rewrite.c>
<IfModule mod_negotiation.c>
Options -MultiViews
</IfModule>
RewriteEngine On
# Redirect Trailing Slashes...
RewriteRule ^(.*)/$ /$1 [L,R=301]
# Handle Front Controller...
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{REQUEST_FILENAME} !-f
RewriteRule ^ index.php [L]
</IfModule>

Listing 3.5 – Apache 2 folder level .htaccess configuration file

24
3.1.5 Git
Git is a free and open source distributed version control system designed to handle everything from
small to very large projects with speed and efficiency [7]. The stand-out feature of Git is that it doesn't
need a centralized server like Subversion needs. Also, Git supports branching, which is extremely helpful
when working on several parts of an application at once, but the work is not ready yet to make it into the
app itself. Also, Git is supported on a wide variety of IDEs, and in tools such as Composer, hence
developers feel much easier to develop modular packages. Below are some features of Git that stand at
the foundation of the software:
• Frictionless Context Switching. Create a branch to try out an idea, commit a few times, switch
back to where you branched from, apply a patch, switch back to where you are experimenting, and
merge it in.
• Role-Based Codelines. Have a branch that always contains only what goes to production, another
that you merge work into for testing, and several smaller ones for day to day work.
• Feature Based Workflow. Create new branches for each new feature you're working on so you can
seamlessly switch back and forth between them, then delete each branch when that feature gets
merged into your main line.
• Disposable Experimentation. Create a branch to experiment in, realize it's not going to work, and
just delete it - abandoning the work—with nobody else ever seeing it (even if you've pushed other
branches in the meantime).
The repository for this project is stored locally and on BitBucket as a private repository. Pushing and
commiting the changes is done through PhpStorm's internal Git extension tools.
3.1.6 OpenOffice
LibreOffice is a comprehensive, professional-quality productivity suite that anyone can download
and install for free. There is a large base of satisfied LibreOffice users worldwide, and it is available in
more than 30 languages and for all major operating systems, including Microsoft Windows, Mac OS X
and GNU/Linux (Debian, Ubuntu, Fedora, Mandriva, Suse, ...). You can download, install and distribute
LibreOffice freely, with no fear of copyright infringement. What's outstanding about LibreOffice?
LibreOffice is a feature-packed and mature desktop productivity package with some really great
advantages:
• It's free – no worry about license costs or annual fees.
• No language barriers – it's available in a large number of languages, with more being added
continually.
• LGPL public license – you can use it, customize it, hack it and copy it with free user support and

25
developer support from our active worldwide community and our large and experienced developer
team.
• LibreOffice is a free software community-driven project: development is open to new talent and
new ideas, and our software is tested and used daily by a large and devoted user community; you,
too, can get involved and influence its future development.
In the proposed system, LibreOffice plays an important role, since the system makes use of the software's
capability of converting documents to HTML format. This is achieved by using LibreOffice's command
line executable, which allows to convert document from many popular formats (such as doc, docx, rtf,
odt) to HTML.
3.1.7 Pysimsearch library
This plagiarism detection system relies on a Python library called Pysimsearch [8] to calculate the
similarity between two full texts, using the Tf-idf algorithm. Tf-idf stands for term frequency-inversed
term frequency, and is a quick algorithm for detecting similarity based on overlapping the term frequency
from one document to the other.
3.1.8 SASS
SASS stands for syntactically awesome stylesheets, and is a tool to improve development with the
CSS styling language. The advantages that SASS brings to CSS are numerous, and some of the most
important are:
• Variables, it is possible to define variables in SASS, which means that everything related to the
configuration can be kept in variables in a separate file. It is also possible to make calculations
with the variables, such as dividing a width into equal parts.
• Mixins, allow to define reusable block of styling that can be included into an element rules like
@mixin. It is very useful for example if we need to make an element have rounded corners, just
include @rounded.
• Functions, SASS comes with many useful functions, especially to work with colors. Functions
like lighten, darken, invert, saturate help to manipulate colors easily and cleverly.
• Inheritance, is probably the best feature that is missing in standard CSS. Inheritance makes writing
CSS code much easier, and helps keep the code clean and structured.
SASS was used to style the user interface for the proposed application.
3.1.9 Text normalization
Normalization is a process that involves transforming characters and sequences of characters into a
formally-defined underlying representation. This process is most important when text needs to be
compared for sorting and searching, but it is also used when storing text to ensure that the text is stored in

26
a consistent representation.
Decomposition:
• Compatibility Decomposition:
It maps the character to one or more other characters that you could use as a replacement for the
first character. This process is not necessary reversible.
For example: [ ˚ :U+02DA RING ABOVE] = [ :U+0020 SPACE] + [ ̊ :U+030A COMBINING
RING ABOVE]
• Canonical Decomposition:
It maps the character to its canonical equivalent character or characters. A canonical mapping is
reversible without losing any information.
For example: [ Å :U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE] = [ A :+0041
LATIN CAPITAL LETTER A] + [ ̊ :U+030A COMBINING RING ABOVE]
Composition:
• Canonical Composition:
Canonical composition (as opposed to canonical decomposition) replaces sequences of a base plus
combining characters with pre-composed characters.
For example: [ A :U+0041 LATIN CAPITAL LETTER A] + [ ̊ :U+030A COMBINING RING
ABOVE] = [ Å :U+00c5 LATIN CAPITAL LETTER A WITH RING ABOVE]
Unicode Normalization:
Unicode Normalization forms define four forms of normalized test. The D, C, KD, and KC normalization
differ both in whether they are the result of an initial canonical or compatibility decomposition, and in
whether the decomposed text is recomposed with canonical composed characters wherever possible.
• D: Canonical decomposition, if not followed by Canonical composition
• C: Canonical decomposition, if followed by Canonical composition
• KD: Compatibility decomposition, if not followed by Canonical composition
• KC: Compatibility decomposition, if followed by Canonical composition
Example:
• Input: [affairé]
• D: a + ff + a + i + r + e + ́
• C: a + ff + a + i + r + é
• KD: a + f + f + a + i + r + e + ́
• KC: a + f + f + a + i + r + é
From above, we can utilize normalization algorithm to:

27
• Strip diacritics:
• Normalization D
• Base characters (non-diacritics) are always in the front
• Diacritics are characters in the Combining diacritical Marks block.
• The results, base and diacritics, are not necessary ASCII
• Split ligature:
• Normalization KC
• Split into multiple characters
• Some split characters contain space, should be trimmed
• The results are not necessary ASCII
• Strip diacritic & Split Ligature:
• Normalization KD
3.1.10 Tf-idf algorithm
Tf-idf stands for term frequency-inverse document frequency, and is an excellent algorithm for
checking whether two documents are similar. Since this is the most important algorithm, upon which the
accuracy of the application depends, more details will be provided about how this algorithm works and
what is it used for. In the year 1998 Google handled 9800 average search queries every day. In 2012 this
number shot up to 5.13 billion average searches per day. The graph given below shows this astronomical
growth. The major reason for Google’s success is because of their PageRank algorithm. PageRank
determines how trustworthy and reputable a given website is. But there is also another part. The input
query entered by the user should be used to match the relevant documents and score them. The analogy
made to Google's algorithm, is that the Tf-idf algorithm is at the foundation of their search relevancy
results. For this purpose, 3 documents have been elaborated below to better explain the Tf-idf algorithm:
Document 1: The game of life is a game of everlasting learning
Document 2: The unexamined life is not worth living
Document 3: Never stop learning
Listing 3.6 – Example documents
Let us imagine that you are doing a search on these documents with the following query: life learning.
The query is a free text query. It means a query in which the terms of the query are typed free-form into
the search interface, without any connecting search operators. Let us go over each step in detail to see
how it all works.

28
3.1.10.1 Term frequency (TF)
Term Frequency also known as TF measures the number of times a term (word) occurs in a
document. Given below are the terms and their frequency on each of the document.

Table 3.1.2: TF for Document 1

Document 1 the game of life is a everlasting learning

Term Frequency 1 2 2 1 1 1 1 1

Table 3.1.3: TF for Document 2

Document 2 the unexamined life is not worth living

Term Frequency 1 1 1 1 1 1 1

Table 3.1.4: TF for Document 3

Document 3 never stop learning

Term Frequency 1 1 1

In reality each document will be of different size. On a large document the frequency of the terms will be
much higher than the smaller ones. Hence we need to normalize the document based on its size. A simple
trick is to divide the term frequency by the total number of terms. For example in Document 1 the term
game occurs two times. The total number of terms in the document is 10. Hence the normalized term
frequency is 2 / 10 = 0.2. Given below are the normalized term frequency for all the documents.

Table 3.1.5: Normalized TF for Document 1

Document 1 the game of life is a everlasting learning

Normalized TF 0.1 0.2 0.2 0.1 0.1 0.1 0.1 0.1

Table 3.1.6: Normalized TF for Document 2

Document 2 the unexamined life is not worth living

0.1428 0.142 0.142


Normalized TF 0.142857 0.142857 0.142857 0.142857
57 857 857

29
Table 3.1.7: Normalized TF for Document 3

Document3 never stop learning

Normalized TF 0.333333 0.333333 0.333333

Given below is the code in python which will do the normalized TF calculation.
def termFrequency(term, document):
normalizeDocument = document.lower().split()
return normalizeDocument.count(term.lower()) / float(len(normalizeDocument))

Listing 3.7 - Python function to calculate normalized TF

3.1.10.2 Inverse document frequency (IDF)


The main purpose of doing a search is to find out relevant documents matching the query. In the first
step all terms are considered equally important. In fact certain terms that occur too frequently have little
power in determining the relevance. We need a way to weigh down the effects of too frequently occurring
terms. Also the terms that occur less in the document can be more relevant. We need a way to weigh up
the effects of less frequently occurring terms. Logarithms helps us to solve this problem. Let us compute
IDF for the term game:
IDF(game) = 1 + loge(Total Number Of Documents/Number Of Documents with term game in it)

There are 3 documents in all = Document1, Document2, Document3

The term game appears in Document1

IDF(game) = 1 + loge(3 / 1)

= 1 + 1.098726209

= 2.098726209

Listing 3.8 - IDF calculation example


Given below is the IDF for terms occurring in all the documents. Since the terms: the, life, is, learning
occurs in 2 out of 3 documents they have a lower score compared to the other terms that appear in only
one document.

Table 3.1.8: IDF for all document terms

Terms IDF

the 1.405507153

game 2.098726209

of 2.098726209

30
life 1.405507153

is 1.405507153

a 2.098726209

everlasting 2.098726209

learning 1.405507153

unexamined 2.098726209

not 2.098726209

worth 2.098726209

living 2.098726209

never 2.098726209

stop 2.098726209

Given below is the python code to calculate IDF:


def inverseDocumentFrequency(term, allDocuments):
numDocumentsWithThisTerm = 0
for doc in allDocuments:
if term.lower() in allDocuments[doc].lower().split():
numDocumentsWithThisTerm = numDocumentsWithThisTerm + 1

if numDocumentsWithThisTerm > 0:
return 1.0 + log(float(len(allDocuments)) / numDocumentsWithThisTerm)
else:
return 1.0

Listing 3.9 - IDF calculation code example in Python


3.1.10.3 TF * IDF
We are trying to find out relevant documents for the query: life learning. For each term in the query
multiply its normalized term frequency with its IDF on each document. In Document 1 for the term life
the normalized term frequency is 0.1 and its IDF is 1.405507153. Multiplying them together we get
0.140550715 (0.1 * 1.405507153). Given below is TF * IDF calculations for life and learning in all the
documents.

Table 3.1.9: TF * IDF

Document1 Document2 Document3

life 0.140550715 0.200786736 0

learning 0.140550715 0 0.468502384

31
3.1.10.4 Vector Space Model – Cosine Similarity
From each document we derive a vector. The set of documents in a collection then is viewed as a set
of vectors in a vector space. Each term will have its own axis. Using the formula given below we can find
out the similarity between any two documents.
Cosine Similarity (d1, d2) = Dot product(d1, d2) / ||d1|| * ||d2||
Dot product (d1,d2) = d1[0] * d2[0] + d1[1] * d2[1] * … * d1[n] * d2[n]
||d1|| = square root(d1[0]2 + d1[1]2 + ... + d1[n]2)
||d2|| = square root(d2[0]2 + d2[1]2 + ... + d2[n]2)

Listing 3.10 - Cosine similarity calculation

Figure 3.1: Cosine similarity vector representation


Vectors deals only with numbers. In this example we are dealing with text documents. This was the
reason why we used TF and IDF to convert text into numbers so that it can be represented by a vector.
According to the book [Mahout in Action], below is a very good explanation of cosine similarity:
“The cosine measure similarity is another similarity metric that depends on envisioning user
preferences as points in space. Hold in mind the image of user preferences as points in an
n-dimensional space. Now imagine two lines from the origin, or point (0,0,…,0), to each of these two
points. When two users are similar, they’ll have similar ratings, and so will be relatively close in
space—at least, they’ll be in roughly the same direction from the origin. The angle formed between
these two lines will be relatively small. In contrast, when the two users are dissimilar, their points
will be distant, and likely in different directions from the origin, forming a wide angle. This angle can
be used as the basis for a similarity metric in the same way that the Euclidean distance was used to

32
form a similarity metric. In this case, the cosine of the angle leads to a similarity value. If you’re
rusty on trigonometry, all you need to remember to understand this is that the cosine value is always
between –1 and 1: the cosine of a small angle is near 1, and the cosine of a large angle near 180
degrees is close to –1. This is good, because small angles should map to high similarity, near 1, and
large angles should map to near –1.”
The query entered by the user can also be represented as a vector. We will calculate the TF*IDF for the
query:

Table 3.1.10: TF * IDF for the query

TF IDF TF*IDF

life 0.5 1.405507153 0.702753576

learning 0.5 1.405507153 0.702753576

Let us now calculate the cosine similarity of the query and Document1.
Cosine Similarity(Query,Document1) = Dot product(Query, Document1) / ||Query|| * ||Document1||

Dot product(Query, Document1)

= ((0.702753576) * (0.140550715) + (0.702753576)*(0.140550715))

= 0.197545035151

||Query|| = sqrt((0.702753576)2 + (0.702753576)2) = 0.993843638185

||Document1|| = sqrt((0.140550715)2 + (0.140550715)2) = 0.198768727354

Cosine Similarity(Query, Document) = 0.197545035151 / (0.993843638185) * (0.198768727354)

= 0.197545035151 / 0.197545035151 = 1

Listing 3.11 - Cosine similarity for Document 1

Given below is the similarity scores for all the documents and the query:

Table 3.1.11: Cosine similarity scores for all documents

Document1 Document2 Document3

Cosine Similarity 1 0.707106781 0.707106781

Plotted below are vector values for the query and documents in 2-dimensional space of life and learning.
Document1 has the highest score of 1. This is not surprising as it has both the terms life and learning.

33
Figure 3.2: Cosine similarity results for the query "life learning"

34
4 System implementation
4.1 Feature implementation
This chapter will describe the application feature implementation, how these features actually work,
what difficulties were encountered and what workarounds were found. Back in 2.1.4, the system's feature
requirements were:
• Upload document
• Convert the document format to a single format
• Normalize the input text
• Store document in local database
• Check for similarities against other locally stored documents
• Get a score based on the similarity test
• See a text difference with similar documents
Below, critical sections of the application implementation will be described, in patches in more detail.
4.1.1 Document upload
The document upload functionality is pretty straightforward. Once one the main page of the
application, the user is presented a way to upload a document. The user can select a file and upload it, as
presented in Figure 8.

Figure 4.1: Document upload


Only a limited number of document formats are allowed: doc, docx, odt, txt, rtf. If the file uploaded is of
different format, the user will see an error and the upload will be halted (Ill. 8). Once the user successfully
upload a valid file format, the document will be saved on the hard drive, and next step will take place
(conversion). The code responsible for storing the file is shown in listing 4.1.
$tmp_directory = storage_path('docs/' . uniqid());
mkdir($tmp_directory);
Input::file('doc')->move($tmp_directory, Input::file('doc')->getClientOriginalName());

Listing 4.1 – Storing the uploaded file with Laravel

35
Figure 4.2: File upload error: format not allowed

4.1.2 Document conversion


After uploading the document on the server, the document needs to be converted to html format. We
need to have the document converted to a single format, because that way we can build all other steps for
one format only, so the cost of converting the document to html is far less than creating doing
normalization for many different formats. To convert the document to HTML we use the command line
application from LibreOffice. According to the documentation on LibreOffice website [9], the command
to convert a document is:
soffice –headless --convert-to output_file_extension[:output_filter_name] [--outdir
output_dir]

Listing 4.2 – Command to convert document with LibreOffice CLI application

In PHP, there is a function called passthru that allows to run command line commands. Therefore, to
convert the uploaded document to HTML, the following snippet of code is run:
passthru('soffice --headless --convert-to html:"HTML" ' . $source . ' --outdir ' .
$output_folder, $retval);

Listing 4.3 – Passthru function executes command in command line

4.1.3 Document text normalization


Normalization is a process that involves transforming characters and sequences of characters into a
formally-defined underlying representation. This process is most important when text needs to be
compared for sorting and searching, but it is also used when storing text to ensure that the text is stored in
a consistent representation. To clean the document content we need to trim unnecessary characters from
the text (such as special characters), remove stop words (words with no special value, such as “a”, “the”,
“in”, etc), remove HTML tags, remove table content (since table data is not relevant in the context, and
most of the time it contains numbers and short information). PHP has a normalization library, that can be
installed using PECL, that can be used to normalize UTF text (accents, non-latin characters, etc). The
normalization done is FORM_C (Canonical Decomposition followed by Canonical Composition).

36
public function get_normalized() {
//remove uneeded tags
$tidy_config = array(
'clean' => true,
'output-xhtml' => true,
'show-body-only' => true,
'wrap' => 0,

);

$tidy = tidy_parse_string($this->formatted_txt, $tidy_config, 'UTF8');


$tidy->cleanRepair();

$text = strval($tidy);
$text = preg_replace('~<table[^>]*>(.*?)</table>~is', '', $text);
$text = preg_replace('~<title[^>]*>(.*?)</title>~is', '', $text);
$text = strip_tags($text);
$text = $this->normalizeEntities($text);
$text = html_entity_decode($text, ENT_COMPAT | ENT_HTML401);
$text = \Normalizer::normalize($text, \Normalizer::FORM_C);
$text = preg_replace('~\s~is', ' ', $text);
$text = trim(preg_replace('~\s\s+~is', ' ', $text));
$text = strtolower($text);
$text = $this->remove_stopwords($text);

return $text;
}

Listing 4.4 – Text normalization

4.1.4 Storing the document


Of big importance is the document storing step. This plagiarism detection system increases its rate of
finding plagiarized documents, based on the growth of stored number of documents. In order to avoid
storing duplicate documents (a duplicate is a document that is 100% identical to another one), we store
the md5 hash of the normalized text in the database. It is possible to use the normalized_text field to
detect duplicates, however the normalized_text field is considerably larger than a hash (which is typically
32 characters long). Also, an index is created on the hash field to speed up queries that include this field
as a constraint. The document is stored using the Laravel's model methods:
$text = (new Utils\Normalize($converted))->get_normalized();
$hash = md5($text);
$doc = new Doc();
$doc->normalized_text = $text;

37
$doc->hash = $hash;
$doc->save();

Listing 4.5 – Document storing in the database


4.1.5 Finding out similar documents
Once we the document has been stored in the database, the user will be redirected to a page, where
the user can see how similar is the document to other documents that are already stored in the database.
As shown in figure 4.3, the scores are shown in percentage. The higher the percentage, the

Figure 4.3: Document similarity scores

more it means that the two documents are similar. From the figure 4.3 we can notice that Document #10
has a 100% similarity with document #9; in fact the real similarity score is 99.99%, but the number was
rounded to 100%. In reality, there can't be two documents with an integer 100% score, since that is a
duplicate, but duplicates are not stored, since we perform the hash check. From this page, we can click the
“Try another search” to return to the home page, and perform a different check.
The actual similarity check is done usign the Pysimsearch Python library, which is based upon the Tf-idf
algorithm. Initially this library has been thought to compute the similarity between two web pages like
this:
$ python similarity.py http://www.stanford.edu/ http://www.berkeley.edu/ http://www.mit.edu/
Comparing files ['http://www.stanford.edu/', 'http://www.berkeley.edu/',
'http://www.mit.edu/']
sim('http://www.stanford.edu/','http://www.berkeley.edu/')=0.322771960247
sim('http://www.stanford.edu/','http://www.mit.edu/')=0.142787018368
sim('http://www.berkeley.edu/','http://www.mit.edu/')=0.248877629741

Listing 4.6 – Pysimsearch usage example


I had to therefore simulate that every document acts like a web page. For that I have created a new route
/doc/text/{document_id} that is fed inside Pysimsearch as parameters. Thus with minimal efforts, the

38
Pysimsearch library functionality has been adapted to serve our system's needs.
4.1.6 Text difference visualization

Figure 4.4: Text difference visualization


Text difference visualization permits the user to see the difference between two texts, in the manner
of a how Git shows the difference. Overall, the same text difference algorithm lays at the foundation. For
this, I have used the FineDiff library [10], that automatically builds the HTML output of the difference
test. As can be seen from Ill. 14, the the text in green represents the text that is found in both documents
in approximatively the same place. The green text means most likely to be plagiarized, unless it has been
properly quoted. The text in red, represents the text that has been removed from the place it should have
been. The purple text represents the new text that has been inserted in between the matched text (green
text). If a red text immediately precedes a purple text, this means that someone tried to change words with
their synonyms, or paraphrase the text. In this case, the person who check this text difference, should pay
more attention to investigate whether a given document is indeed plagiarized. The following code is used

39
to render the text difference formatted output:
$doc1_text = convert_html_to_text(file_get_contents(storage_path('docs/' . $doc1_id . '/' .
$doc1_id . '.html')));
$doc2_text = convert_html_to_text(file_get_contents(storage_path('docs/' . $doc2_id . '/' .
$doc2_id . '.html')));
$granularity = new cogpowered\FineDiff\Granularity\Word;
$diff = new cogpowered\FineDiff\Diff($granularity);
return nl2br(preg_replace('~\s{3,}~Uis', '', $diff->render($doc1_text, $doc2_text)));

Listing 4.6 – Text difference rendering

40
5 Economic Analysis
5.1 Project description
Living in an information age, we tend to protect our intellectual property with different laws and
moral ethics. Information theft is a rising problem in the modern society, since it comes with different
complications: it violates the copyrights, discourages originality and uniqueness, which in turn can have
an adverse effect on scientific development. There is an obvious need for ways to prevent the spread of
this malignant phenomenon, which is called plagiarism. The proposed plagiarism detection system, comes
in very handy for academic institutions, teachers to help them find out whether a student's submitted
paper is original and can be graded for that criterion, or otherwise, take the required measures to prevent
this from happening again. This system is very simple to use, since it only requires to upload the
document in its original form, and the system will give instant results about similarity with other, already
submitted document, and highlight a possible plagiarism case.
5.2 Schedule
Since the proposed system requires constant improvement, and feature optimization, the schedule
will be made based on the first iteration of the application. The schedule is made up of 3 parts, which
describe the planning: determining the objectives; estimating the work time needed and tasks division;
time required to implement each of the tasks.
5.2.1 Objectives
Setting the objectives is an important step, since this will keep the team focused on following a plan
and creating a functional initial product release. Objectives are also important to make sure that the team
can keep track of deadlines, and allow enough time for each objective to create every application
component securely and with quality.
5.2.2 Work amount estimate
The time needed to create an initial release product, can be divided into 3 chunks. The first one will
be used for planning the application, setting objectives, building UML graphs of internal components and
interactions. We should allow plenty of time for this step, because we want to stay focused on following
our objectives an avoid veering from the initial goals. The second amount of time will be dedicated to
development. During this phase, developers will implement the functionality of the system into runnable
software, that testers can further use to make sure that the quality standards are met. The third amount of
time will be assigned for deployment tasks, server setting.
5.2.3 Schedule
To correctly estimate the amount of time required to develop the product, it is mandatory to spot out
the bigger parts of the tasks necessary for implementation, and split these tasks into smaller activities, that
can be evaluated and managed easier. Thanks to this, it will be easier to define the logical steps for project

41
implementation. The duration of development can be represented by the following formula:
Duration = Start date – Finish date + reserve time
Where the time between the start date and finish date is the time when the activities are made, and the
reserve time is a buffer time added to make for any unplanned circumstances beyond the planned time.
The following table describes the schedule for the development of the proposed system, with the
following notations: PM – project manager, SD – software developer, SA – system architect, PO –
product owner, PC – personal computer, EA – enterprise architect, IDE – integrated development
environment, T – tester, D – designer.

Table 5.2.1: Project schedule

Duration, Who approves the


Nr. Activity name Worker Resources used
days activity

Internet, PC, Office


1 Define concept of idea 14 PM, SD, SA PM, PO
requisites

Internet, PC, books,


2 UML and DB modeling 14 SA PM, PO
EA

3 Choose technologies 7 SD, SA PM Internet, PC

Internet, PC, IDE,


4 Business logic 30 SD PM
books

Internet, PC, IDE,


5 Design 30 D, SD, T, PM
books

Internet, PC, IDE,


6 Testing and fixing 20 T PM
books

7 Deployment 3 SD, SA PM, PO Internet, PC, IDE

8 Reports, follow-ups 5 PM PO MS Word, Latex

Hence, according to the schedule presented earlier, the expected time of delivery is 123 days, plus 21 days
reserve time. Given this table, we can count how much time is required by every worker to do their work.
Note: the days described here are actually working days, 1 day = 8 hours.

42
• Project manager – 118 days
• System architect – 38 days
• Software developer – 84 days
• Designer – 30 days
• Tester – 50 days
5.3 Economic proof
It is actual to bring economical proofs for the IT projects, based on the specific of the concurrencies
in economic relationships, which assume a wide research space. In the conditions of a low degree of
determination of the marketing environment, of high prices’ volatility, decreased degree of prognoses
depth, a common business-plan doesn’t allow the exact foreseeing of the final results of the business. In
this context, one of the basic instruments is choosing the methods, the right positions and parameters for
the economical proofs. Realization of this goal conditions a large number of scientifically researches,
subordinated to the primary goal and formulated by means of the following objectives:

a) studying the theoretical and methodical aspects of the business-planning in the conditions of the
concurrency on the market;
b) systematization, determining the methodology and specifying the indices for the economical
proof of the business-plans in IT;
c) study and analysis of the actual practice of economical proofs of the business-plans for IT in
Republic of Moldova;
d) developing methodological concepts of the proofs of the decision of investment in the
conditions of risk and incertitude;
e) Studying the evaluation criteria of the business-projects’ efficiency and elaboration of a
mechanism of complex evaluation of these.

5.3.1 Long term material and non-material expenses


Direct material costs are the costs of direct materials which can be easily identified based on the unit
of production. For example, the cost of glass is a direct materials cost in light bulb manufacturing. The
manufacture of products or goods required material as the prime element. In general, these materials are
divided into two categories. These categories are direct materials and indirect materials. Direct materials
are also called productive materials, raw materials, raw stock, stores and only materials without any
descriptive title. The direct material for a developer serves their working computer, and other devices
such as flash drive CD-r and others. In table 5.3.1 and 5.3.2 below are presented the material and
non-material expenses that arise during its development.

43
Table 5.3.1: Long term materials expenses

Nr Name Price (MDL) Quantity Sum (MDL)

1 PC 10500 5 52500

2 Printer 1500 1 1500

5 Server 10000 1 10000

Total 64.000
In the non-material expenses, table 5.3.2, will be included the tools/software that was used in the
development process.

Table 5.3.2: Long term non-material expenses

Nr Name Measure Price Quantity Sum (MDL)

2 Linux distro MDL 0 2 0

3 Gimp MDL 0 1 0

4 Enterprise Architect MDL 2587 1 2587

Total 2.587

In the table 5.3.3 are included the direct expenses, logistics products that were used during the project
development.

Table 5.3.3: Direct material costs

Nr Name Unit price (lei) Quantity Sum(lei)

1 Office paper ( 500 sheets) 47 2 94

2 Printer Ink 113 2 226

3 Pen 10 15 150

4 Marker 15 6 90

5 Office board 475 1 475

Total 1.035

44
The total expenses for material and non-material expenses activities for the development of this project
are 67.622 MDL.
5.3.2 Salary expenses
For the job of the workers the salary is calculated according to the table 5.3.4, where we can see the
amount of work (time) done and the price for each kind of job.

Table 5.3.4: Salary Expenses

Nr Position Number of Number of Salary (MDL/day) FSB and


employees days salary
taxes,
MDL

1 System 1 38 720 27360


Architect

2 Project 1 118 400 47200


Manager

3 Programmer 2 84 640 107520

4 Designer 1 30 560 16800

Tester 1 50 500 25000

5 Cleaning lady 1 40 100 4000

Total pay off for all 227880


workers

Social fund (23%) 52412,4

Medical Assurance 9115,2


(4%)

Total work 289407,6


remuneration
In the table 5.3.4 are represented all the spent resources for employees and taxes. Here are the formulas:
Frm = 27360 +47200 +107520+16800 + 25000 + 4000 = 227880 (lei) (5.1)
Where Frm is “Fondul de Retribuire a Muncii”, and on its basis is calculated the FS

45
FS = Frm * Cfs (5.2)
Where FS is the sum for the contributions for the Fondul Social (FS) and C fs is contribution quota for the
state mandatory social insurance, approved each year by the Law of Budget state (in 2014 - 23%)
FS = 227880 * 0.23 = 52412,4 (5.3)
MA = Frm * Cam (5.4)
Where MA is Medical Assurance and Cam is the medical assurance quota approved each year by the Law
of Budget for state medical assurance (in 2014 - 4%)
MA = 227880 * 0.04 = 9115,2 (lei) (5.5)
The sum Frm + FS + MA will be the total expense for work retribution
Total = 227880 + 52412,4 + 9115,2 = 289407,6 (lei) (5.6)
Retirement fund consist of 6% from the gross income and will be calculated by the following formula:
Retirement fund = Gross income * 0.06 = 227880 * 0.06 = 13.672,8 (lei) (5.7)
5.3.3 Indirect expenses
The fixed means pay-off is the partial loss of the consumable properties and value of the means
during their usage, influenced by different factors and the increase of the work productivity. For
computing electricity usage we’ll take into account that PC uses 400W per working hour (8 hours per day
* 123 usage days, 984 hours). Therefore we have:
Total power usage = (5 x 400 x 984)/1000 = 1.968kW (5.8)

Table 5.3.5: Indirect expenses

Nr Name Measure Quantity Tariff (lei) Sum, MDL

1 Power usage kW/h 1968 1.58 3.109,44

2 Internet month 6 175 1050

3 Office Rent month 6 2500 15000

4 Office Meals month 6 175 1050

5 Repair services units 3 433 1299

6 Client Meetings month 6 200 1200

Total 22.708,44

46
5.3.4 Wear calculation
The table 5.3.6 shows the wear of the equipment used for this project. The equipment from the
following table has the partial loss of the consumable properties and value of the means during their
usage, influenced by different factors and the increase of the work productivity.
FA = (V ∗T) /T1 (5.9)
Where V - initial value of the active, T1 - useful usage time of the active, T - actual time the active will be
used in project development.

Table 5.3.6: Material wear cost

Long term material active Initial Value, MDL Useful usage time, Wear expenses,
months MDL

PC 52500 36 8750

Printer 1500 36 250

Server 10000 36 1666,66

Total 10.666,66

5.4 Project cost


The total cost of the project may be computed by adding together all the expenses, which is
illustrated in table 5.4.1.

Table 5.4.1: Project expenses

Computing Article Value, MDL %

Material expenses wear cost 10.666,66 3.30

Salary Expenses 227880 70.60

Social Fund 52412,4 16.24

Medical Assurance 9115,2 2.82

Indirect Expenses 22.708,44 7.03

Total 322.782,7 100

47
5.5 Financial results
The plagiarism detection system project will have a Net Turnover of 450.000 MDL (CAb).
The net income is computed by removing 20% from the gross income
CAn = CAb – TVA = 450000 – 20% = 360000 MDL (5.10)
The gross profit can be calculate from the formula 5.12, where PC is the total project cost.
Pb = CAn – PC = 360000 – 322.782,7 = 37.217,3 MDL (5.11)
Now the gross profit is determined, we can compute the net profit. To get the net profit we have to
subtract a certain percentage from the gross profit that is dependent of the entity (a simple person or a
legal personality). In our case the company “NoPlagiat” SRL is a legal personal, and the percentage to
subtract is 12%. Thus the net profit formula looks like:
Pn = 37.217,3 – 37.217,3 *0.12 = 32.751,23 MDL (5.12)
The profitability indicators may be computed to see the product success on the market
Sales profitability = Pn/CAb * 100% = ( 32.751,23 /450000) * 100 = 7.28 % (5.13)
Economic profitability = Pn/PC * 100% = ( 32.751,23 /322.782,7)*100 = 10.14 % (5.14)
5.6 Conclusion
The plagiarism detection system does not have a high profitability percentage, probably due to the
fact it is intended for use by academic institutions that do not afford to pay a lot for expensive software.
The project might receive extra funding from the government to support the good cause it brings to the
society. Once the system has a stable platform, the functionality can be sold to other institutions of other
countries in the form Software as a Service, which means they will pay for using our plagiarism detection
service (we do not sell the software itself, nor any source code).

48
Conclusions
After developing and testing the plagiarism detection system, it was found that the proposed objectives
and requirements have been met. The system will help people, and in most part academic institutions, to
combat the plagiarism phenomenon, and encourage fair play. The system allows to upload and check a
document for similarity with other stored documents. As more documents are added to the database, the
plagiarism detection chances get higher and the system becomes more effective. The user interface is very
simple and easy to use, and does not require extra investment in software learning and tutoring. The
system's functionality is offered in the form of Software as a Service, does not require any software to be
installed on the user's computer (except for an Internet browser), and can be used remotely.
The system was built using modern technologies, that in most part are open-source and do not have a
commercial license. From these can be enumerated a few: PHP 5.3 was used as the programming
language of the most application and being developed in this language, this means that the software can
then be developed and maintained easier since PHP is a popular programming language; LibreOffice has
been used to convert documents from different file formats to a single format that is HTML, the
conversion being made through the command line executable; Laravel framework has been used as the
skeleton for the application, and also for keeping the good coding practices, since Laravel enforces some
well put in place coding conventions; Pysimsearch application has been used to perform the similarity
between two full texts. The algorithm that checks whether two texts are similar, is called Term frequency
– inverse document frequency, and from all tested algorithms and solutions, is the fastest and most
efficient. The database uses MySQL and Apache 2 has been used as the web server application. By
inter-connecting all these technologies and libraries it was possible to build a functional plagiarism
detection system, that works and has satisfactory results.
The created system is easy to maintain, and can be extended to offer more functionality and improved
efficiency. To receive better results, it is needed to improve the similarity checking algorithm. Future
plans for this system are to create a management system of users per institution, since an academic
institution might need several accounts for their accreditation committee members; also the user interface
needs to have a cosmetic re-vamp to simplify the interaction with the system; as the database of
documents get bigger and bigger, the time required to perform a check will grow linearly, thus to avoid
the server from crashing, a queue of jobs needs to be implemented, and the similarity check operation
needs to be moved to a background operation; to keep the development active for this system, there needs
to be an yearly fee for academic institutions that will use the plagiarism detection system the most. Also,
support for the development of this system could be obtained from state grants, since this software can
have a benign effect on the society.

49
Bibliography
1. http://www.uq.edu.au/myadvisor/academic-integrity-and-plagiarism 2012
2. Peter Joy and Kevin McMunigal, “The Problems of Plagiarism as an Ethics Offense” 2011
3. L. A. Jensen, J. J. Arnett, S. S. Feldman and E. Cauffman, It’s wrong, but everybody does it:
academic dishonesty among high school students, Contemporary Educational Psychology, 27(2),
209-228
4. Liuba Lupasco, Timpul, TEHNOLOGII: Tezele ASEM-iştilor trecute prin detectorul de plagiat,
June 18 2012
5. http://turnitin.com/en_us/features/faqs
6. R. N. Taylor, N. Medvidovic, E. M. Dashofy, Software Architecture: Foundations, Theory, and
Practice, January 9 2009
7. http://webuzo.com/sysapps/version_control/Git March 12, 2014
8. https://github.com/taherh/pysimsearch by Taherh
9. http://ask.libreoffice.org/en/question/2641/convert-to-command-line-parameter/ , May 12 2012
10. http://demo.icu-project.org/icu-bin/nbrowser , online tool to test UTF text normalization
11. https://github.com/cogpowered/FineDiff , the text difference library

50

S-ar putea să vă placă și