Documente Academic
Documente Profesional
Documente Cultură
With the advent of powerful search engines, that can provide results semantically relevant to the user's
queries, finding and taking any piece of information has become extremely easy. This has become a major
issue, since students, writers, or even academic people use chunks of information in their works, without
providing proper citation to the original source. This poses a major risk, since plagiarism is considered a
copyright infringement, hence a fraud.
This work intends to show research in building a plagiarism detection system. Multiple methods are
presented, with advantages and disadvantages, and technologies being used. The emphasis will be put on
the plagiarism detection system, and not mainly on the algorithm.
This research aims to show how a plagiarism detection system can be built by using well known
technologies, and combine them to find out whether a document is considered plagiarized or unique. This
system relies on a database of normalized documents, against which comparisons will be made. This
software could be used in the education system to detect if the paperworks are unique, encouraging new
ideas, and discouraging content copying and idea theft.
1
Rezumat
Odată cu dezvoltarea motoarelor de căutare performante, care returnează rezultate semantice relevante la
cautările utilizatorului, găsirea si utilizarea informației a devenit extrem de simplă. Acest fapt a devenit o
problemă majoră, deoarece studenții, scriitorii, sau chiar si oamenii de tip academic foloses bucăți de
informație in lucrările lor, fără să adauge citarea corectă la sursa originală. Aceasta ridică o problemă
majoră, deoarce plagiarizmul este considerat o încălcare a drepturilor de autor, și deci este considerat o
fraudă.
Această lucrare intenționează să arate o cercetare în crearea unui sistem de detectare a plagiarismului.
Mai multe metode sunt prezentate, cu avantaje și dezavantaje și tehnologiile utilizate. Se pune în evidență
crearea sistemului de detectare a plagiarismului, și nu cercetarea algoritmului principal.
Această cercetare are ca scop să arate cum un sistem de detectarea a plagiarismului poate fi dezvoltat
utilizând tehnologii bine cunoscute și a le combina pentru a detecta dacă un document este considerat
plagiat sau unic. Aceset sistem se bazează pe o bază de date de documente normalizate, cu care se fac
comparațiile. Acest soft ar putea fi utilizat in sistemul de educație pentru a detecta daca lucrările
studenților sunt unice și să încurajeze ideile noi, dar și într-același timp să descurajeze plagierea
informației străine.
2
Introduction
Over the past few years, the information resources have more than tripled in size thanks to the advent of
the World Wide Web. It is extremely easy now to find a variety of information, on different subjects, from
different authors, and even the same information from different sources. While this has its positive
aspects (such as information freedom), there are also the negative aspects. Thing is, the publicly available
information was aimed to serve as a personal cognitive material, and not to be used in personal work,
where uniqueness claim is made.
“Plagiarism is the practice of taking someone else’s work or ideas and passing them off as one’s own.” -
New Oxford American Dictionary
As mentioned in an article from the University of Southern Queensland [1], plagiarism can occur from
different causes. Some of them are misusing resources, without providing proper citation; copying and
paraphrasing blocks of text; or even abusing the limits allowed on group works. Generally speaking,
plagiarism as a term, is not applicable to textual documents only. Plagiarism can also be found in the
software industry, when code is used without providing proper source identification and credit to the
original author; plagiarism can also be found in the graphic industry, where something like icons, colors,
or text used in a graphic product can be considered plagiarized if it resembles very much another more
early work. This research will focus on text plagiarism only.
Peter Joy and Kevin McMunigal [2] have reflected on the problem of plagiarism in the academic
environment. According to them, a major grading criterion for a work is its originality, and when someone
submits for grading a paperwork with ideas prevalent to other authors, that can be considered a “cardinal
sin”. A paperwork needs to represent the personal findings and ideas of its author, since this represent the
individuality of a person. Copying someone else's idea, and claiming that to be yours, is like trying to
impersonate someone, which is unethical and unlawful.
There could be several ways to combat the plagiarism phenomenon: to prevent it, or to detect it and take
the appropriate measures. Preventing plagiarism has more to do with measures taken by teachers, in
academic institutions, who should make sure that students are aware that plagiarism is a bad habit and
that they can be punished for it. However, preventing something from happening is very difficult, and
impossible to make sure that people conform to the rules. If the process cannot be controlled, the end
product can easily be. For that reason, plagiarism detection is an effective method to make sure that a
work is unique and reflects the person who created it.
3
1 Domain analysis
With the rapid growth of the World Wide Web, more and more information has become freely and
publicly available for any knowledge seeker. Unfortunately, this has brought the plagiarism phenomenon
to pose even higher risks in different environments. One of the most affected environment is the academic
institutions.
Education is a double-sided, where one one side the student does research and learns from it, and on the
other side, the teacher grades this knowledge. When a student obtains ready made information for an
assignment and uses that without diving into the essentials of it, he does not learn anything new, and does
not learn to build a critical thinking. On the other side, it turns out that the teacher's assesment is
erroneous, and does not reflect the correct level of education. This can have some serious implications in
the long term, since generations come and go, and the upcoming students are taught by those who used to
practice the bad habits of plagiating content. This defies the purpose of humanity – constant evolution.
According to a study, in American secondary schools, summing up the different types of plagiarism, that
number reaches 90% [3]. There is a more recent study, and a tentative to solve the plagiarism problem in
Republic of Moldova [4]. According to a preliminary test with a plagiarism detection tool, there were
cases where research papers matched with 100% similarity percentage, but some others gave a positive
alarm with 40-50% similarity. As of this time, the software seems to have not been adopted by the
academic institution in cause (ASEM).
Unfortunately, every year, the number of research papers that contain plagiarized content grows with big
number. Teachers cannot cope with manually detecting plagiarized content, since the volume of papers is
big, and the sources of information are constantly growing, therefore it is impossible for a human to keep
track of all papers and spot out the ones that are plagiarised. There is definitely the need to use an
automatic detection system, that can both grow its database of document, and detect with ease the
similarity between them.
There are a series of available tools for plagiarism detection. Among all of them, there is one distinct
service, called Turnitin [5] that is the leader in plagiarism detection services. The mechanisms behind this
service are kept in secret, and they barely offer any insights how the system works behind the scenes. One
thing to be noted by this system, is that they store student works in their database, so that they can use that
for future testing. According to their statistics, more than 337 million student works have been added to
their database, mostly added by academic institutions, so that clearly serves as a reminder that the
education system needs a plagiarism detection tool to cope with duplicated content among student works.
Below are some well known plagiarism detection systems, compared by features:
4
Table 1.1: Existing solutions comparison
System name Searches the Searches in local Performs text Provides detailed
Internet? database? normalization? information?
Advego Plagiatus Yes No No Yes
Istio No Yes No Partially
Miratools Yes No Yes Yes
Plagiat-inform Yes Yes No Yes
Praide Unique Yes No No Yes
Content Analyzer II
Proposed solution No Yes Yes Yes
By examining the table 1.1, we can notice that most plagiarism detection systems rely on searching the
Internet for duplicates and content similarity. Most of the time they use common search engines to
perform these searches (Google, Bing, Yahoo! Search). The main drawback with this solution is that
search engines are meant to search for keywords, and not chunks of full text. Moreover, most of them
protect against multiple similar requests by either temporarily blocking access to their services, or
enforcing a captcha test. Therefore, such systems need to also implement sophisticated proxy connections,
which turn out to be unreliable and not scalable.
5
2 System architecture
2.1 System vision
2.1.1 Problem statement
It is easily understandable that the rate at which the plagiarism cases arise is growing at an alarming
pace. According to an article from the Open Education Database (December 19, 2010), an informal poll
from 2007 revealed that 60.8% of polled college students have admitted cheating, and 16.5% didn't regret
that. Now the disturbing news is that, according to the article, cheaters got higher grade marks than those
who study honestly. This is a bad sign, since this can have a discouraging effect over those students who
do their work honestly, and also, the grading is kept artificially high by people who did not deserve their
marks. Moreover, it is hardly possible to manually detect if a work is plagiarized due to the high number
of students and works submitted. A need for an automatic plagiarism detection system is apparent. This
can easily search across massive amounts of information to tell if a document is plagiarized or not. The
beneficial effect of a successful plagiarism detection tool is that it will hopefully discourage students from
practicing this bad habit, and encourage creating unique, quality content.
2.1.2 System objectives
• Implement an algorithm that can identify whether a document is considered plagiarized, based on
a similarity percentage. The “plagiarized” status will be given to a document, based on
experimental similarity percentage findings.
• Have the ability to check a document against a database of existing documents.
• Show similarities between the two documents,, to better understand what words, phrases,
paragraphs have been plagiarized.
• Have a good performance index. The system needs to perform well with thousands of documents
stored in the database.
The system will allow to determine whether a document is duplicated or not. This decision will be made
by checking the document content against a series of locally stored documents by using a fast performing
text similarity algorithm. If the document similarity index is higher, this means that one of the two
documents might contain plagiarized content. The person using this system, will then have the ability to
see what portions of the document are found in the other document. The technique used for this is called
text-diff, and allows to see where different words have been used in place of the original ones, to trick the
plagiarism detection software.
2.1.3 Identifying the stakeholders
The stakeholders represent all parties that can affect the development of this system. By making an
analysis of the people who are going to use this system, it can be concluded that there are 3 parties. First
6
of them are the users, which in our case could be teachers, academic institutions, students, or whoever
needs to check a specific document against plagiarism. This party will be using the system actively, and it
is them who can provide additional feedback for continuous system improvement. The second party are
the authors, or people whose works are checked against plagiarism. These people do not use the system
directly (unless they need to check their works for uniqueness for personal reasons), but provide a
substantial contribution to the whole plagiarism detection system. The contribution that they provide lies
in their works that are added to the local database. The bigger this database gets, the better results can the
system provide and the easier it is to spot out plagiarized documents. To be noted that the efficiency of the
system described in this paper, highly relies on its arsenal of stored information. Subsequently, the third
identified party are the developers who maintain the system. It is their task to continuously research and
improve the algorithm used for the plagiarism detection, so that the stored information can be analyzed in
the best way.
2.1.4 Documenting the functional requirements of the system
The functional requirements of a system describe the functionality and components that need to be
present in the application described. By analyzing the system objectives, several requirements can be
spotted out:
• Upload document
• Convert the document format to a single format
• Normalize the input text
• Store document in local database
• Check for similarities against other locally stored documents
• Get a score based on the similarity test
• See a text difference with similar documents
The first requirement of this system is to allow the person who uses it to upload a document into the
system. Once the document has been uploaded, it needs to be converted to a specific format (text/html in
our case). Without this step, it would be impossible (or hardly possible) to perform further operations on
the text. Further down the road, the system needs to take care of the content, specifically the text. Several
normalization techniques are applied to clean out unnecessary characters (such as non-latin characters),
images, tables, or other content that is hard to parse and analyze; also, stop words are removed (stop
words are words defined as too general to take into consideration for uniqueness, such as function words
“at”, “this”, “the”, “not”, etc). Once the text has been carefully normalized, we need to store it in the
database, so we can avoid performing the aforementioned steps again when we need to check for
similarity with other documents. Another requirement is the similarity test; this test makes use of the text
7
analysis algorithm to find out how similar two documents are. Based on this similarity test, we can get a
similarity score (or index) in percentage. If the score is 100%, this means that the two documents are
duplicates; therefore, the lower the score is, the more this means that the two documents are unique and
original. The final requirement in our list is the text difference feature, which allows the person who is
performing the test, to see the actual textual similarity between the two documents. This is presented in a
text-diff manner, that shows in green the text that is present in the other document, and in red/purple the
words/phrases that are different between the two documents. Based on this text-difference test a person
can see if someone tried to replace words with their synonyms to trick the system in believing that they
work is unique and original.
2.1.5 Documenting the non-functional requirement of the system
Non-functional requirements are the other requirements that don't do anything specific, but are
important characteristics of the system that can have a positive/negative impact over the quality of the
system. By analyzing today's standards, several requirements have been identified:
• Performance
• Usability
• Reliability
• Modifiability
• Maintainability
• Accuracy and precision
From the proposed list of non-functional requirements, two of them are of higher importance for this
system type. These are the performance and accuracy requirements. It is very important that the system
can provide fast results, considering that the number of documents added to the database is constantly
growing, and the time needed to perform plagiarism check, also grows linearly. Usually, algorithms that
provide the best results perform worse, so a balance between performance and reliability needs to be
found. The other important requirement, is the accuracy one. It is mandatory that the system be accurate
in its results, since the number of documents is high, so receiving a large set of irrelevant results can lead
to frustration and cease to use the system. Also the system needs to be maintainable and modifiable, since
development on it should be continuous by nature, therefore development on it should feel like a breeze.
2.2 Architectural representation
2.2.1 Use Case Diagrams
To represent the system components, interaction between them, and the parties that will interact with
the system's functionality, UML diagrams will be used for simplicity and convenience. In figure 2.1, the
system's use case diagram is represented that represents the system features and functionality. A user can
8
perform several actions with the plagiarism detection system, such as uploading a document to be stored
in the database. But before storing this document in the database, the document is converted to a
predefined format, and the content of this document is normalized. Next, the user can choose to check
whether a document is plagiarized; they can select to use one of the defined algorithms: normalized
compression distance or term frequency - inverse document frequency algorithm. They can also see the
text difference of two stored document, that will be shown in a manner similar to how Git works,
representing the content that is common to both documents, or show where terms have been replaced with
other terms (presumably synonyms).
9
Figure 2.2: Actors interacting with the system and actions they can perform
The end user can be anyone of simple users, teachers, students, or academic institutions that need to
check a document for uniqueness. The end user can upload a new document into the system's database
(unless an exact copy is already present, it will show that the uploaded document is a duplicate), find
other documents that are similar in content with their document, and see the text difference of two similar
documents.
2.2.2 State Diagram
State diagrams (also called State Chart diagrams) are used to better understand any complex / unusual
functionalities or business flows of specialized areas of the system. State diagrams depict the dynamic
behavior of the entire system, or a sub-system, or even a single object in a system. After examining the
proposed system and the states of users actions in the system, the diagram in figure 2.3 has been
elaborated. Before performing any action on the system, the user needs to authenticate themselves in the
system, that is log into the system. Then the user can choose several ways to interact with the system.
They can upload a new document to the database, delete their existing documents, or see text difference
between two documents. After they have finished interacting with the system, they can choose to log out
of the system.
10
Figure 2.3: System's state diagram
2.2.3 Chosen architectural patterns
Architectural patterns is a way of solving recurring architectural problems. They promote the code
reuse, and follow conventions to encourage the habit of writing good code. As described by R. N. Taylor
[4]:
“An Architectural Pattern is a named collection of architectural design decisions that are applicable
to a recurring design problem parameterized to account for different software development contexts
in which that problem appears.”
The benefits of using architectural patterns is that they provide a common language for developers. This
allows to spend more time on extending the application with features rather than get into specifics of how
some functionality should be better built. Every architectural pattern permits you to achieve a specific
global system property, such as the adaptability of the user interface. In this sub-chapter will be described
the most commonly used architectural patterns. The Client/Server pattern segregates the system into two
applications, where the client makes requests to the server. In many cases, the server is made of a
database with application logic represented as stored procedures. The Component-Based Architecture
pattern decomposes application design into reusable functional or logical components that expose
11
well-defined communication interfaces. The Domain Driven Design architectural pattern is an
object-oriented architectural style focused on modeling a business domain and defining business objects
based on entities within the business domain. The Layered architecture pattern partitions the concerns of
the application into stacked groups (layers). The Message Bus patters is an architecture style that
prescribes use of a software system that can receive and send messages using one or more communication
channels, so that applications can interact without needing to know specific details about each other.
N-Tier / 3-Tier architecture segregates functionality into separate segments in much the same way as the
layered style, but with each segment being a tier located on a physically separate computer. The
Object-Oriented pattern is a design paradigm based on division of responsibilities for an application or
system into individual reusable and self-sufficient objects, each containing the data and the behavior
relevant to the object. The Service-Oriented Architecture (SOA) refers to applications that expose and
consume functionality as a service using contracts and messages. Most of the times, the architecture of an
application is the combination of several architectural patterns that make up the complete system.
Analyzing the proposed system and choosing from the best suitable architectural patterns, the
multi-layered architectural pattern has been chosen. This architectural pattern uses several layers to
separate the different responsibilities inside the software system. The framework upon which this system
is built on has the MVC pattern at its core which adheres to the layered architectural pattern.
MVC stands for Model-View-Controller. MVC separates domain / application / business logic from the
rest of the user interface. It does this by separating the application into three parts: the model, the view,
and the controller. The model manages fundamental behaviors and data of the application. It can respond
to requests for information, respond to instructions to change the state of its information, and even to
notify observers in event-driven systems when information changes. This could be a database, or any
number of data structures or storage systems, but is not limited to just storage systems. The view
effectively provides the user interface element of the application. It will render the data from the model
into a form that is suitable for the user interface. The controller performs the logic part of the application,
it receives user input and makes calls to model objects and the view to perform appropriate actions.
The architectural principles guide the design of a system (business system, information system,
infrastructural system). They set the rules for design decisions where business criteria can be translated
into language and specifications. The principles require developing a framework that includes appropriate
policies and procedures to support their implementation. Architectural principles respected by this
application are:
• High cohesion;
• Loose coupling;
12
• Separation of concerns;
• Information hiding;
• Liskov Substitution;
• Inversion of control;
• Interface segregation;
• Modularity;
• Design for change;
• Convention over configuration.
Well-defined responsibility boundaries for each layer and the assurance that each layer contains
functionality directly related to the tasks of that layer will help to maximize cohesion within the layer.
Communication between the layers of the system is based on abstraction and this provides loose coupling
between the layers. Each layer has predefined tasks and concerns. The separated Presentation patterns
divides UI processing concerns into distinct roles. Since the layers have specific responsibilities, no
assumption need to be made about data types, methods and properties, or implementation during design,
as these features are not exposed at layer boundaries, thus assuring information hiding. The Liskov
substitution is assured by the separation between functionalities in each layer is clear. Lower layers have
no dependencies on higher layers, which make replacement of one layer quite possible. Due to the fact
that there is only a single layer (which is an interface to the needed functionality) the interface segregation
principle is also fulfilled by the system’s architecture. The modularity of the system is achieved because
each layer in the system is a module and the layers interact with each other through well-defined
interfaces. Since everything is divided into modules/layers, it is easier to add new functionality (without
affecting the behavior of entire application), therefore ensuring that system is designed for change. The
principle convention over configuration is guaranteed by using naming convention between data layer and
logic layer.
2.2.4 Architectural sketch
In figure 2.4 we have represented the architectural pattern that we mentioned to be used in our system
earlier. On the top level is the presentational layer, which is responsible for presenting the information to
the user. The presentational layer represents an user interface that the user can use to interact with our
application. It is responsible for templating, result caching, returning the correct result format based on
the request made. Also, this is how our system can interact back with the user (for example, returning
results, or asking for more user input). This layer does not contain any business logic, but only represents
the data sent from the lower layer, which is the business layer. The business layer coordinates the
application, routes the user requests to controllers that in return process the request and returns the result
13
to the presentational layer. The business layer lays between the presentational layer and the data access
layer, and interacts more with the latter. The latest layer in our list is the data access layer where the
information is stored and fetched from the storage medium (a database for this system). The only layer
that has direct access to the information in no abstract way, is the data access layer. Also, it is here where
data is being cached to speed up similar queries in short periods of time.
14
field for a document in our system, because it contains the information that we use to find whether a
document is plagiarized or not. The “created_at” and “updated_at” fields are time stamps, and hold the
creation time and when changes are made to the document's state. The “hash” field, which is of type
string, represents the md5 hash of the “normalized_text” field, and is used to avoid storing a duplicate
document twice. The hash field permits to check if a the same document is already stored in the database
much faster, since it is also an index and is much shorter than the “normalized_text” field. The methods of
the “document” class are actually inherited from the base class “Eloquent” and will be described later.
The user class represents the “User” object and represents the actor information that actually interacts
with our system. The “id” field is used for the same purpose as described earlier. The “name” field
represents the name of the user, and is only used to visually identify the currently logged in user in the
system. The “role” field describes the role of the user in the system. It can be a simple user, that can only
use publicly open system functionality and has limited actions that can modify the data. On the other
hand, there is the administrator role that has broader privileges, and can add new users, delete users,
manage documents of any user, perform chore tasks such as cache pruning, database optimization, and re
building the normalized_text field (in case we improve the normalization algorithms). The “Eloquent”
class is a base class that serves as a starting ground for most of the classes in our system. It contains the
common properties and methods to interact with the data layer. Some of the important methods of this
class are “find”, which is used to fetch an object from the database based on the ID; the “delete” method
used to delete an object from the database; “where” - which is used to find objects based on a conditional
query. Also, represented in the Class Diagram is the relationship between the “User” class and the
“Document” class. The relationship is one to many (1..*), which is 1 “User” to many “Documents”. This
means that an user can have many documents that belong to them, but no document can belong to more
than a user.
15
Figure 2.5: The Class Diagram
16
Figure 2.6: The Component's Diagram
17
3 Technologies used
In order to developer a product, or service it is necessary to find the right tools and technologies that
will alleviate the development, cut on costs, and speed up development time. There are a wide variety of
technologies to use for the given system, and in the end it comes to one's own preferences, experience
background, and goals.
3.1 Chosen technologies
There are many factors to take into consideration when making a decision on what tools and
technologies to use when developing a product or service. A stand-alone implementation is beneficial
when the end product needs to perform a single task, with little dependencies on other tools and services.
Text processing applications are a good example of stand alone implementation, because such
applications start quickly on ones computer, and are always there, installed, ready to boot and start
processing commands. Also, media players are also a good example of stand-alone implementation since
sometimes Internet connectivity might be missing, so online playback is not possible. Another software
implementation is the client-server one. Such implementation is used when the software solution implies
a server that runs the main application, and a client computer that is used to run the requests. The
web-based option is when the main application is installed on a remote server that contains all the
computational logic, and specially built client applications or just internet browsers are used to make use
of the application installed on the remote server. This is a perfect solution in these days, since everything
tends to move to the cloud, and the user only needs to have a powerful enough devices to be able to send
requests and process results and visualize these results on the screen. Good examples of web based
applications are Facebook, SoundCloud, or for example the package tracking software used by DHL. Web
based solutions are convenient because all data is stored on the company's servers, and also the
algorithms, and the source code of the applications is not exposed at all (not even in binary format, like in
the case of stand-alone applications). Also, web based solutions are more convenient from the economic
point of view because it is easier to monetize such software and control licensing. The next point to be
considered is the construction of the system. There are four constructions options that can be applied to
the existing systems. The standard product represents solutions that are available as standard products.
The product encapsulates a set of built-in features, much like a pre-fabricated building. If the features
available in the product are a good fit for the requirements of the business processes and policies, a
standard product may be a good choice. Customized construction is when several standard products have
facilities to configure and customize features, requiring varying amount of expertise. The amount of
customization required to achieve the required facilities is the prime consideration to choose this option.
The custom-built solutions have a high degree of flexibility and can be built to suit the exact requirements
of the business. However, the costs and risks can be higher. New development methodologies like Agile
18
Development can minimize these costs and risks. The last type of construction is the open source
solutions. Open Source applications have emerged as a new option for constructing IT solutions. Built by
teams of programmers who collaborate by volunteering their efforts without compensation and for mutual
benefit, several open source applications have grown to become feature-rich products with complete
source code available to any user. Enterprises can use open source applications with low upfront
investments, and with relevant technical skills, can customize the code to match their requirements. Open
source alternatives for basic requirements like email access, web site management, basic collaboration,
etc. are now easily found, but applications for business process management are few and not yet mature
enough. However, this is a space to watch. One factor that has to be taken into account when choosing the
technology is the budget of software and its type of license. Another important factor is the
internationalization support. Considering these factors mentioned and the purpose of the system, the
technologies have been chosen accordingly.
3.1.1 MySQL 5
MySQL 5 has been chosen as the database management system. There are many database
management systems (DBMS) on the market which provide the same capabilities (at some extent). In the
process of choosing the desired management system, a comparison was performed. As shown in table
3.1.1, the DBMS chosen for examination where: Microsoft SQL, Oracle database, PostgreSQL, MySQL
and IBMs DB2. The next terms have to be explained: ACID (Atomicity, Consistency, Isolation,
Durability) is a set of properties that guarantee that the database transactions are processed reliably. All
the DBMS support it. Referential integrity is a property which, when satisfied, means that every value of
one column of a table needs to exists as a value of another column in a different or the same table. For the
referential integrity to hold in a relational database, any field in a table that is declared a foreign key can
contain either a null value, or only values from a parent table's primary key or a candidate key. In other
words, when a foreign key value is used it must reference a valid, existing primary key in the parent table.
It is also supported by all DBMS, with an exception in MySQL – it partially supports it. There are
differences at of OS support level – MSSQL naturally runs on Windows, while PostgreSQL is available
on Windows, Mac, Linux, UNIX, BSD and Android. MSSQL and DB2 have a limited database size,
while PostgreSQL, MySQL and Oracle are unlimited. When considering the database capabilities, it can
be noticed that MySQL has a wide support for different operation systems, and does not lack many of the
other's DBMS capabilities. The same applies to the indexes. MySQL also support them without any
limitation, while other system applied constraints or lack some of the indexes specified. Considering other
type of objects, data types, foreign keys and associate operations (i.e. cascade delete) – MySQL support a
wider range than other systems.
19
Table 3.1.1: Database management system comparison
1 2 3 4 5
Referential
Yes Yes Yes Partial Yes
Integrity
Max DB
524,272 TB Unlimited Unlimited Unlimited 512 TiB
size
R-/R+ tree,
Hash Hash (Cluster R-/R+ tree,
R-/R+ tree
(Non/Cluster tables), Hash,
(MyISAM tables Expression,
& fill factor), Expression, Expression,
only), Hash Reverse,
Indexes Expression, Partial, Partial, Reverse,
(MEMORY, Bitmap,
Partial, Reverse, Bitmap, Gist,
Cluster, InnoDB, Full-text
Fulltext, Bitmap, GIN, Full-text,
tables only)
Spatial Full-text, Spatial(PostGIS)
Spatial
20
and Clobs,
Merge joins, Merge joins,
Table Common
Blobs and Blobs and
Expression, Table
Clobs, Clobs,
Windows Expression,
Common Common
Functions, Windows
Table Table
Parallel Query Functions,
Expression Expression
Parallel Query
Data Domain
Data
Data Domain, (via Check
Domain ,
Cursor, Data Domain, Constraint),
Cursor, Cursor, Trigger,
Trigger, Cursor, Trigger, Cursor,
Other Trigger, Function,
Function, Function, Trigger,
objects Function, Procedure,
Procedure, Procedure, Function,
Procedure, External routine
External External routine Procedure,
External
routine External
routine
routine
Subset of
Native data
SQL'92 types
types, including
plus specific
boolean, money,
types. Some
date-time and Broad subset of
SQL'92 types
numeric types. SQL’92,
Data types are mapped
SQL'92 data including SQL’92
into Oracle
types syntax are numeric types
types. No
mapped directly
boolean type
into native
nor
Postgres types.
equivalent.
21
time and easy to maintain. Laravel has everything needed to build a modern web application. It has
support for active record style database interaction through its Eloquent base model class. In listing 3.1
we can see how easy it is to manipulate with an object in Laravel. The Task object in this case, is actually
represented by objects stored in the database. Thanks to an Active-Record model style implemented in
Laravel, there is no need to write any single line of raw database queries, since the Eloquent model maps
everything to the fields in the database and abstracts the methods of storing and retrieving the
information.
// Fetch all tasks
$tasks = Task::all();
// Update a task
$task = Task::find(1);
$task->title = 'Put that cookie down!';
$task->save();
// Delete a task
Task::find(1)->delete();
22
});
Composer is built with PHP and because of the huge PHP community, has given life to a lot of publicly
accessible packages on websites like packagist.org and others. The good thing about this tool is that it
encourages code reuse and the development of modular applications. Thanks to this, a module can be
re-used in other applications, thus cutting on development time, cost, and maintenance. Composer's
dependencies are defined in a composer.json file, that contains a JSON structure. An example is
provided in listing 3.3, that shows how a configuration file can be built to require two packages inside a
project. Also, Composer will built a special autoloading file, that can be used inside an application to lazy
load classes only when there is need to instantiate an object of that class type.
{
"require": {
"illuminate/foundation": ">=1.0.0",
"mockery/mockery": "dev-master"
},
"autoload": {
"classmap": [
"app/controllers",
"app/models",
"app/database/migrations",
"app/tests/TestCase.php"
]
}
}
23
extensibility. Apache is free software, distributed by the Apache Software Foundation that promotes
various free and open source advanced Web technologies. Apache is very easy to configure and extend its
functionality. Listing 3.4 shows how a new virtualhost can be set up from an Apache .conf file.
# Ensure that Apache listens on port 80
Listen 80
# Listen for virtual host requests on all IP addresses
NameVirtualHost *:80
<VirtualHost *:80>
DocumentRoot /www/example1
ServerName www.example.com
</VirtualHost>
<VirtualHost *:80>
DocumentRoot /www/example2
ServerName www.example.org
</VirtualHost>
24
3.1.5 Git
Git is a free and open source distributed version control system designed to handle everything from
small to very large projects with speed and efficiency [7]. The stand-out feature of Git is that it doesn't
need a centralized server like Subversion needs. Also, Git supports branching, which is extremely helpful
when working on several parts of an application at once, but the work is not ready yet to make it into the
app itself. Also, Git is supported on a wide variety of IDEs, and in tools such as Composer, hence
developers feel much easier to develop modular packages. Below are some features of Git that stand at
the foundation of the software:
• Frictionless Context Switching. Create a branch to try out an idea, commit a few times, switch
back to where you branched from, apply a patch, switch back to where you are experimenting, and
merge it in.
• Role-Based Codelines. Have a branch that always contains only what goes to production, another
that you merge work into for testing, and several smaller ones for day to day work.
• Feature Based Workflow. Create new branches for each new feature you're working on so you can
seamlessly switch back and forth between them, then delete each branch when that feature gets
merged into your main line.
• Disposable Experimentation. Create a branch to experiment in, realize it's not going to work, and
just delete it - abandoning the work—with nobody else ever seeing it (even if you've pushed other
branches in the meantime).
The repository for this project is stored locally and on BitBucket as a private repository. Pushing and
commiting the changes is done through PhpStorm's internal Git extension tools.
3.1.6 OpenOffice
LibreOffice is a comprehensive, professional-quality productivity suite that anyone can download
and install for free. There is a large base of satisfied LibreOffice users worldwide, and it is available in
more than 30 languages and for all major operating systems, including Microsoft Windows, Mac OS X
and GNU/Linux (Debian, Ubuntu, Fedora, Mandriva, Suse, ...). You can download, install and distribute
LibreOffice freely, with no fear of copyright infringement. What's outstanding about LibreOffice?
LibreOffice is a feature-packed and mature desktop productivity package with some really great
advantages:
• It's free – no worry about license costs or annual fees.
• No language barriers – it's available in a large number of languages, with more being added
continually.
• LGPL public license – you can use it, customize it, hack it and copy it with free user support and
25
developer support from our active worldwide community and our large and experienced developer
team.
• LibreOffice is a free software community-driven project: development is open to new talent and
new ideas, and our software is tested and used daily by a large and devoted user community; you,
too, can get involved and influence its future development.
In the proposed system, LibreOffice plays an important role, since the system makes use of the software's
capability of converting documents to HTML format. This is achieved by using LibreOffice's command
line executable, which allows to convert document from many popular formats (such as doc, docx, rtf,
odt) to HTML.
3.1.7 Pysimsearch library
This plagiarism detection system relies on a Python library called Pysimsearch [8] to calculate the
similarity between two full texts, using the Tf-idf algorithm. Tf-idf stands for term frequency-inversed
term frequency, and is a quick algorithm for detecting similarity based on overlapping the term frequency
from one document to the other.
3.1.8 SASS
SASS stands for syntactically awesome stylesheets, and is a tool to improve development with the
CSS styling language. The advantages that SASS brings to CSS are numerous, and some of the most
important are:
• Variables, it is possible to define variables in SASS, which means that everything related to the
configuration can be kept in variables in a separate file. It is also possible to make calculations
with the variables, such as dividing a width into equal parts.
• Mixins, allow to define reusable block of styling that can be included into an element rules like
@mixin. It is very useful for example if we need to make an element have rounded corners, just
include @rounded.
• Functions, SASS comes with many useful functions, especially to work with colors. Functions
like lighten, darken, invert, saturate help to manipulate colors easily and cleverly.
• Inheritance, is probably the best feature that is missing in standard CSS. Inheritance makes writing
CSS code much easier, and helps keep the code clean and structured.
SASS was used to style the user interface for the proposed application.
3.1.9 Text normalization
Normalization is a process that involves transforming characters and sequences of characters into a
formally-defined underlying representation. This process is most important when text needs to be
compared for sorting and searching, but it is also used when storing text to ensure that the text is stored in
26
a consistent representation.
Decomposition:
• Compatibility Decomposition:
It maps the character to one or more other characters that you could use as a replacement for the
first character. This process is not necessary reversible.
For example: [ ˚ :U+02DA RING ABOVE] = [ :U+0020 SPACE] + [ ̊ :U+030A COMBINING
RING ABOVE]
• Canonical Decomposition:
It maps the character to its canonical equivalent character or characters. A canonical mapping is
reversible without losing any information.
For example: [ Å :U+00C5 LATIN CAPITAL LETTER A WITH RING ABOVE] = [ A :+0041
LATIN CAPITAL LETTER A] + [ ̊ :U+030A COMBINING RING ABOVE]
Composition:
• Canonical Composition:
Canonical composition (as opposed to canonical decomposition) replaces sequences of a base plus
combining characters with pre-composed characters.
For example: [ A :U+0041 LATIN CAPITAL LETTER A] + [ ̊ :U+030A COMBINING RING
ABOVE] = [ Å :U+00c5 LATIN CAPITAL LETTER A WITH RING ABOVE]
Unicode Normalization:
Unicode Normalization forms define four forms of normalized test. The D, C, KD, and KC normalization
differ both in whether they are the result of an initial canonical or compatibility decomposition, and in
whether the decomposed text is recomposed with canonical composed characters wherever possible.
• D: Canonical decomposition, if not followed by Canonical composition
• C: Canonical decomposition, if followed by Canonical composition
• KD: Compatibility decomposition, if not followed by Canonical composition
• KC: Compatibility decomposition, if followed by Canonical composition
Example:
• Input: [affairé]
• D: a + ff + a + i + r + e + ́
• C: a + ff + a + i + r + é
• KD: a + f + f + a + i + r + e + ́
• KC: a + f + f + a + i + r + é
From above, we can utilize normalization algorithm to:
27
• Strip diacritics:
• Normalization D
• Base characters (non-diacritics) are always in the front
• Diacritics are characters in the Combining diacritical Marks block.
• The results, base and diacritics, are not necessary ASCII
• Split ligature:
• Normalization KC
• Split into multiple characters
• Some split characters contain space, should be trimmed
• The results are not necessary ASCII
• Strip diacritic & Split Ligature:
• Normalization KD
3.1.10 Tf-idf algorithm
Tf-idf stands for term frequency-inverse document frequency, and is an excellent algorithm for
checking whether two documents are similar. Since this is the most important algorithm, upon which the
accuracy of the application depends, more details will be provided about how this algorithm works and
what is it used for. In the year 1998 Google handled 9800 average search queries every day. In 2012 this
number shot up to 5.13 billion average searches per day. The graph given below shows this astronomical
growth. The major reason for Google’s success is because of their PageRank algorithm. PageRank
determines how trustworthy and reputable a given website is. But there is also another part. The input
query entered by the user should be used to match the relevant documents and score them. The analogy
made to Google's algorithm, is that the Tf-idf algorithm is at the foundation of their search relevancy
results. For this purpose, 3 documents have been elaborated below to better explain the Tf-idf algorithm:
Document 1: The game of life is a game of everlasting learning
Document 2: The unexamined life is not worth living
Document 3: Never stop learning
Listing 3.6 – Example documents
Let us imagine that you are doing a search on these documents with the following query: life learning.
The query is a free text query. It means a query in which the terms of the query are typed free-form into
the search interface, without any connecting search operators. Let us go over each step in detail to see
how it all works.
28
3.1.10.1 Term frequency (TF)
Term Frequency also known as TF measures the number of times a term (word) occurs in a
document. Given below are the terms and their frequency on each of the document.
Term Frequency 1 2 2 1 1 1 1 1
Term Frequency 1 1 1 1 1 1 1
Term Frequency 1 1 1
In reality each document will be of different size. On a large document the frequency of the terms will be
much higher than the smaller ones. Hence we need to normalize the document based on its size. A simple
trick is to divide the term frequency by the total number of terms. For example in Document 1 the term
game occurs two times. The total number of terms in the document is 10. Hence the normalized term
frequency is 2 / 10 = 0.2. Given below are the normalized term frequency for all the documents.
29
Table 3.1.7: Normalized TF for Document 3
Given below is the code in python which will do the normalized TF calculation.
def termFrequency(term, document):
normalizeDocument = document.lower().split()
return normalizeDocument.count(term.lower()) / float(len(normalizeDocument))
IDF(game) = 1 + loge(3 / 1)
= 1 + 1.098726209
= 2.098726209
Terms IDF
the 1.405507153
game 2.098726209
of 2.098726209
30
life 1.405507153
is 1.405507153
a 2.098726209
everlasting 2.098726209
learning 1.405507153
unexamined 2.098726209
not 2.098726209
worth 2.098726209
living 2.098726209
never 2.098726209
stop 2.098726209
if numDocumentsWithThisTerm > 0:
return 1.0 + log(float(len(allDocuments)) / numDocumentsWithThisTerm)
else:
return 1.0
31
3.1.10.4 Vector Space Model – Cosine Similarity
From each document we derive a vector. The set of documents in a collection then is viewed as a set
of vectors in a vector space. Each term will have its own axis. Using the formula given below we can find
out the similarity between any two documents.
Cosine Similarity (d1, d2) = Dot product(d1, d2) / ||d1|| * ||d2||
Dot product (d1,d2) = d1[0] * d2[0] + d1[1] * d2[1] * … * d1[n] * d2[n]
||d1|| = square root(d1[0]2 + d1[1]2 + ... + d1[n]2)
||d2|| = square root(d2[0]2 + d2[1]2 + ... + d2[n]2)
32
form a similarity metric. In this case, the cosine of the angle leads to a similarity value. If you’re
rusty on trigonometry, all you need to remember to understand this is that the cosine value is always
between –1 and 1: the cosine of a small angle is near 1, and the cosine of a large angle near 180
degrees is close to –1. This is good, because small angles should map to high similarity, near 1, and
large angles should map to near –1.”
The query entered by the user can also be represented as a vector. We will calculate the TF*IDF for the
query:
TF IDF TF*IDF
Let us now calculate the cosine similarity of the query and Document1.
Cosine Similarity(Query,Document1) = Dot product(Query, Document1) / ||Query|| * ||Document1||
= 0.197545035151
= 0.197545035151 / 0.197545035151 = 1
Given below is the similarity scores for all the documents and the query:
Plotted below are vector values for the query and documents in 2-dimensional space of life and learning.
Document1 has the highest score of 1. This is not surprising as it has both the terms life and learning.
33
Figure 3.2: Cosine similarity results for the query "life learning"
34
4 System implementation
4.1 Feature implementation
This chapter will describe the application feature implementation, how these features actually work,
what difficulties were encountered and what workarounds were found. Back in 2.1.4, the system's feature
requirements were:
• Upload document
• Convert the document format to a single format
• Normalize the input text
• Store document in local database
• Check for similarities against other locally stored documents
• Get a score based on the similarity test
• See a text difference with similar documents
Below, critical sections of the application implementation will be described, in patches in more detail.
4.1.1 Document upload
The document upload functionality is pretty straightforward. Once one the main page of the
application, the user is presented a way to upload a document. The user can select a file and upload it, as
presented in Figure 8.
35
Figure 4.2: File upload error: format not allowed
In PHP, there is a function called passthru that allows to run command line commands. Therefore, to
convert the uploaded document to HTML, the following snippet of code is run:
passthru('soffice --headless --convert-to html:"HTML" ' . $source . ' --outdir ' .
$output_folder, $retval);
36
public function get_normalized() {
//remove uneeded tags
$tidy_config = array(
'clean' => true,
'output-xhtml' => true,
'show-body-only' => true,
'wrap' => 0,
);
$text = strval($tidy);
$text = preg_replace('~<table[^>]*>(.*?)</table>~is', '', $text);
$text = preg_replace('~<title[^>]*>(.*?)</title>~is', '', $text);
$text = strip_tags($text);
$text = $this->normalizeEntities($text);
$text = html_entity_decode($text, ENT_COMPAT | ENT_HTML401);
$text = \Normalizer::normalize($text, \Normalizer::FORM_C);
$text = preg_replace('~\s~is', ' ', $text);
$text = trim(preg_replace('~\s\s+~is', ' ', $text));
$text = strtolower($text);
$text = $this->remove_stopwords($text);
return $text;
}
37
$doc->hash = $hash;
$doc->save();
more it means that the two documents are similar. From the figure 4.3 we can notice that Document #10
has a 100% similarity with document #9; in fact the real similarity score is 99.99%, but the number was
rounded to 100%. In reality, there can't be two documents with an integer 100% score, since that is a
duplicate, but duplicates are not stored, since we perform the hash check. From this page, we can click the
“Try another search” to return to the home page, and perform a different check.
The actual similarity check is done usign the Pysimsearch Python library, which is based upon the Tf-idf
algorithm. Initially this library has been thought to compute the similarity between two web pages like
this:
$ python similarity.py http://www.stanford.edu/ http://www.berkeley.edu/ http://www.mit.edu/
Comparing files ['http://www.stanford.edu/', 'http://www.berkeley.edu/',
'http://www.mit.edu/']
sim('http://www.stanford.edu/','http://www.berkeley.edu/')=0.322771960247
sim('http://www.stanford.edu/','http://www.mit.edu/')=0.142787018368
sim('http://www.berkeley.edu/','http://www.mit.edu/')=0.248877629741
38
Pysimsearch library functionality has been adapted to serve our system's needs.
4.1.6 Text difference visualization
39
to render the text difference formatted output:
$doc1_text = convert_html_to_text(file_get_contents(storage_path('docs/' . $doc1_id . '/' .
$doc1_id . '.html')));
$doc2_text = convert_html_to_text(file_get_contents(storage_path('docs/' . $doc2_id . '/' .
$doc2_id . '.html')));
$granularity = new cogpowered\FineDiff\Granularity\Word;
$diff = new cogpowered\FineDiff\Diff($granularity);
return nl2br(preg_replace('~\s{3,}~Uis', '', $diff->render($doc1_text, $doc2_text)));
40
5 Economic Analysis
5.1 Project description
Living in an information age, we tend to protect our intellectual property with different laws and
moral ethics. Information theft is a rising problem in the modern society, since it comes with different
complications: it violates the copyrights, discourages originality and uniqueness, which in turn can have
an adverse effect on scientific development. There is an obvious need for ways to prevent the spread of
this malignant phenomenon, which is called plagiarism. The proposed plagiarism detection system, comes
in very handy for academic institutions, teachers to help them find out whether a student's submitted
paper is original and can be graded for that criterion, or otherwise, take the required measures to prevent
this from happening again. This system is very simple to use, since it only requires to upload the
document in its original form, and the system will give instant results about similarity with other, already
submitted document, and highlight a possible plagiarism case.
5.2 Schedule
Since the proposed system requires constant improvement, and feature optimization, the schedule
will be made based on the first iteration of the application. The schedule is made up of 3 parts, which
describe the planning: determining the objectives; estimating the work time needed and tasks division;
time required to implement each of the tasks.
5.2.1 Objectives
Setting the objectives is an important step, since this will keep the team focused on following a plan
and creating a functional initial product release. Objectives are also important to make sure that the team
can keep track of deadlines, and allow enough time for each objective to create every application
component securely and with quality.
5.2.2 Work amount estimate
The time needed to create an initial release product, can be divided into 3 chunks. The first one will
be used for planning the application, setting objectives, building UML graphs of internal components and
interactions. We should allow plenty of time for this step, because we want to stay focused on following
our objectives an avoid veering from the initial goals. The second amount of time will be dedicated to
development. During this phase, developers will implement the functionality of the system into runnable
software, that testers can further use to make sure that the quality standards are met. The third amount of
time will be assigned for deployment tasks, server setting.
5.2.3 Schedule
To correctly estimate the amount of time required to develop the product, it is mandatory to spot out
the bigger parts of the tasks necessary for implementation, and split these tasks into smaller activities, that
can be evaluated and managed easier. Thanks to this, it will be easier to define the logical steps for project
41
implementation. The duration of development can be represented by the following formula:
Duration = Start date – Finish date + reserve time
Where the time between the start date and finish date is the time when the activities are made, and the
reserve time is a buffer time added to make for any unplanned circumstances beyond the planned time.
The following table describes the schedule for the development of the proposed system, with the
following notations: PM – project manager, SD – software developer, SA – system architect, PO –
product owner, PC – personal computer, EA – enterprise architect, IDE – integrated development
environment, T – tester, D – designer.
Hence, according to the schedule presented earlier, the expected time of delivery is 123 days, plus 21 days
reserve time. Given this table, we can count how much time is required by every worker to do their work.
Note: the days described here are actually working days, 1 day = 8 hours.
42
• Project manager – 118 days
• System architect – 38 days
• Software developer – 84 days
• Designer – 30 days
• Tester – 50 days
5.3 Economic proof
It is actual to bring economical proofs for the IT projects, based on the specific of the concurrencies
in economic relationships, which assume a wide research space. In the conditions of a low degree of
determination of the marketing environment, of high prices’ volatility, decreased degree of prognoses
depth, a common business-plan doesn’t allow the exact foreseeing of the final results of the business. In
this context, one of the basic instruments is choosing the methods, the right positions and parameters for
the economical proofs. Realization of this goal conditions a large number of scientifically researches,
subordinated to the primary goal and formulated by means of the following objectives:
a) studying the theoretical and methodical aspects of the business-planning in the conditions of the
concurrency on the market;
b) systematization, determining the methodology and specifying the indices for the economical
proof of the business-plans in IT;
c) study and analysis of the actual practice of economical proofs of the business-plans for IT in
Republic of Moldova;
d) developing methodological concepts of the proofs of the decision of investment in the
conditions of risk and incertitude;
e) Studying the evaluation criteria of the business-projects’ efficiency and elaboration of a
mechanism of complex evaluation of these.
43
Table 5.3.1: Long term materials expenses
1 PC 10500 5 52500
Total 64.000
In the non-material expenses, table 5.3.2, will be included the tools/software that was used in the
development process.
3 Gimp MDL 0 1 0
Total 2.587
In the table 5.3.3 are included the direct expenses, logistics products that were used during the project
development.
3 Pen 10 15 150
4 Marker 15 6 90
Total 1.035
44
The total expenses for material and non-material expenses activities for the development of this project
are 67.622 MDL.
5.3.2 Salary expenses
For the job of the workers the salary is calculated according to the table 5.3.4, where we can see the
amount of work (time) done and the price for each kind of job.
45
FS = Frm * Cfs (5.2)
Where FS is the sum for the contributions for the Fondul Social (FS) and C fs is contribution quota for the
state mandatory social insurance, approved each year by the Law of Budget state (in 2014 - 23%)
FS = 227880 * 0.23 = 52412,4 (5.3)
MA = Frm * Cam (5.4)
Where MA is Medical Assurance and Cam is the medical assurance quota approved each year by the Law
of Budget for state medical assurance (in 2014 - 4%)
MA = 227880 * 0.04 = 9115,2 (lei) (5.5)
The sum Frm + FS + MA will be the total expense for work retribution
Total = 227880 + 52412,4 + 9115,2 = 289407,6 (lei) (5.6)
Retirement fund consist of 6% from the gross income and will be calculated by the following formula:
Retirement fund = Gross income * 0.06 = 227880 * 0.06 = 13.672,8 (lei) (5.7)
5.3.3 Indirect expenses
The fixed means pay-off is the partial loss of the consumable properties and value of the means
during their usage, influenced by different factors and the increase of the work productivity. For
computing electricity usage we’ll take into account that PC uses 400W per working hour (8 hours per day
* 123 usage days, 984 hours). Therefore we have:
Total power usage = (5 x 400 x 984)/1000 = 1.968kW (5.8)
Total 22.708,44
46
5.3.4 Wear calculation
The table 5.3.6 shows the wear of the equipment used for this project. The equipment from the
following table has the partial loss of the consumable properties and value of the means during their
usage, influenced by different factors and the increase of the work productivity.
FA = (V ∗T) /T1 (5.9)
Where V - initial value of the active, T1 - useful usage time of the active, T - actual time the active will be
used in project development.
Long term material active Initial Value, MDL Useful usage time, Wear expenses,
months MDL
PC 52500 36 8750
Total 10.666,66
47
5.5 Financial results
The plagiarism detection system project will have a Net Turnover of 450.000 MDL (CAb).
The net income is computed by removing 20% from the gross income
CAn = CAb – TVA = 450000 – 20% = 360000 MDL (5.10)
The gross profit can be calculate from the formula 5.12, where PC is the total project cost.
Pb = CAn – PC = 360000 – 322.782,7 = 37.217,3 MDL (5.11)
Now the gross profit is determined, we can compute the net profit. To get the net profit we have to
subtract a certain percentage from the gross profit that is dependent of the entity (a simple person or a
legal personality). In our case the company “NoPlagiat” SRL is a legal personal, and the percentage to
subtract is 12%. Thus the net profit formula looks like:
Pn = 37.217,3 – 37.217,3 *0.12 = 32.751,23 MDL (5.12)
The profitability indicators may be computed to see the product success on the market
Sales profitability = Pn/CAb * 100% = ( 32.751,23 /450000) * 100 = 7.28 % (5.13)
Economic profitability = Pn/PC * 100% = ( 32.751,23 /322.782,7)*100 = 10.14 % (5.14)
5.6 Conclusion
The plagiarism detection system does not have a high profitability percentage, probably due to the
fact it is intended for use by academic institutions that do not afford to pay a lot for expensive software.
The project might receive extra funding from the government to support the good cause it brings to the
society. Once the system has a stable platform, the functionality can be sold to other institutions of other
countries in the form Software as a Service, which means they will pay for using our plagiarism detection
service (we do not sell the software itself, nor any source code).
48
Conclusions
After developing and testing the plagiarism detection system, it was found that the proposed objectives
and requirements have been met. The system will help people, and in most part academic institutions, to
combat the plagiarism phenomenon, and encourage fair play. The system allows to upload and check a
document for similarity with other stored documents. As more documents are added to the database, the
plagiarism detection chances get higher and the system becomes more effective. The user interface is very
simple and easy to use, and does not require extra investment in software learning and tutoring. The
system's functionality is offered in the form of Software as a Service, does not require any software to be
installed on the user's computer (except for an Internet browser), and can be used remotely.
The system was built using modern technologies, that in most part are open-source and do not have a
commercial license. From these can be enumerated a few: PHP 5.3 was used as the programming
language of the most application and being developed in this language, this means that the software can
then be developed and maintained easier since PHP is a popular programming language; LibreOffice has
been used to convert documents from different file formats to a single format that is HTML, the
conversion being made through the command line executable; Laravel framework has been used as the
skeleton for the application, and also for keeping the good coding practices, since Laravel enforces some
well put in place coding conventions; Pysimsearch application has been used to perform the similarity
between two full texts. The algorithm that checks whether two texts are similar, is called Term frequency
– inverse document frequency, and from all tested algorithms and solutions, is the fastest and most
efficient. The database uses MySQL and Apache 2 has been used as the web server application. By
inter-connecting all these technologies and libraries it was possible to build a functional plagiarism
detection system, that works and has satisfactory results.
The created system is easy to maintain, and can be extended to offer more functionality and improved
efficiency. To receive better results, it is needed to improve the similarity checking algorithm. Future
plans for this system are to create a management system of users per institution, since an academic
institution might need several accounts for their accreditation committee members; also the user interface
needs to have a cosmetic re-vamp to simplify the interaction with the system; as the database of
documents get bigger and bigger, the time required to perform a check will grow linearly, thus to avoid
the server from crashing, a queue of jobs needs to be implemented, and the similarity check operation
needs to be moved to a background operation; to keep the development active for this system, there needs
to be an yearly fee for academic institutions that will use the plagiarism detection system the most. Also,
support for the development of this system could be obtained from state grants, since this software can
have a benign effect on the society.
49
Bibliography
1. http://www.uq.edu.au/myadvisor/academic-integrity-and-plagiarism 2012
2. Peter Joy and Kevin McMunigal, “The Problems of Plagiarism as an Ethics Offense” 2011
3. L. A. Jensen, J. J. Arnett, S. S. Feldman and E. Cauffman, It’s wrong, but everybody does it:
academic dishonesty among high school students, Contemporary Educational Psychology, 27(2),
209-228
4. Liuba Lupasco, Timpul, TEHNOLOGII: Tezele ASEM-iştilor trecute prin detectorul de plagiat,
June 18 2012
5. http://turnitin.com/en_us/features/faqs
6. R. N. Taylor, N. Medvidovic, E. M. Dashofy, Software Architecture: Foundations, Theory, and
Practice, January 9 2009
7. http://webuzo.com/sysapps/version_control/Git March 12, 2014
8. https://github.com/taherh/pysimsearch by Taherh
9. http://ask.libreoffice.org/en/question/2641/convert-to-command-line-parameter/ , May 12 2012
10. http://demo.icu-project.org/icu-bin/nbrowser , online tool to test UTF text normalization
11. https://github.com/cogpowered/FineDiff , the text difference library
50