Sunteți pe pagina 1din 4

Literature Review on Data Normalization and clustering

Data Normalization:

Introduction:

Data normalization is a standarized way of making data structure clean and keeping it efficient
by eliminating the data duplication and errors in data operation. “It is a process in which data
attributes within a data model are organized to increase the cohesion of entity types”. The aim of
conducting data normalization process in a set of data is to eliminate the data redundancy
because it is difficult in a relational database to store objects sharing similar attributes in several
tables.

For a successful database design, data normalization plays a vital role. Without normalization the
database operations can generate errors and the database system can be poor, inefficient and
inaccurate.

Normalization techniques:

Normalization is a process of efficiently organizing the data in the database. It ensures that there
is no redundant data and the data dependency exists on each set of data. This will helps in
reducing the data space and increasing the performance.

Normalization techniques are the set of rules and each rule is called “Normal Form” or NF.
Forms ranges from the first normal form (1NF) to fifth normal form (5NF) into a series of
increasing normalization level. There is also one higher level, called domain key normal form
(DK/NF). For the time being the mainly three basic different forms of data are described.

i. First normal form (1NF)


ii. Second normal form (2NF)
iii. Third normal form (3NF)

First normal form (1NF): An entity type is said to be in first normal form if it does not contain
any repeating columns in a table. The First normal form can be achieved by

i. Eliminating the repeating groups from the same table.


ii. Aggregating similar data in a separate tables and identify each row with a
unique identifier or Primary key (PK).

Second normal form (2NF): An entity type is said to be in second normal form when all of its
attributes depends upon the primary key of the table satisfying the rule of 1NF. The Second
normal form can be achieved by

iii. Breaking the table and placing the related entity on the separate table with
unique identifier.
Third normal form (3NF): An entity type is in 3NF when it is in 2NF and all of its attributes are
directly dependent on the primary key satisfying the rule of 2NF

i. Third Normal form can be achieved by further splitting the second normal form.

Disadvantage of non normalized data:

There are several issues with the development while processing with non normalized data.

i. Repetition of information on the database.


ii. Possibility of Loss of information.
iii. Difficulty in maintaining information.
iv. Inconsistent in data operation.
.

Therefore, the data set needs to be normalized before processing the data set providing functional
dependencies and reducing non-key data redundancy.

Advantage / Goals:

Efficient and Functional database is a key to successful development. This can be achieved
through normalization by storing a data in the database where it is logically and uniquely
belongs. There are mainly four objective of normalization:

i. Arranging data into logical groups such that each group describes a small entity
of the whole.
ii. Minimizing the amount of duplicated data stored in a database.
iii. Building a database in which we can access and manipulate the data quickly and
efficiently without compromising the integrity of the data storage.
iv. Organising the data such that, when you modify it, you make the changes in
only one place.

Normalization process in current research:

For this research purpose, the original dataset was taken from the home office site
http://rds.homeoffice.gov.uk/rds/soti.html having the crime details within UK from 2003 to 2010
with a total number of offences of 79272. The data was in the comma separated value (.CSV)
format.

As per the requirement of my research, I used MSSQL for storing and manipulating the dataset.
Therefore I need to convert the .CSV file into MSSQL server. Now the second phase of the
database was to get standarized dataset through data normalization. The steps are as follows.

i. Transformation from CSV files to MSSQL database.


ii. Naming or labelling the data columns to a meaningful entity name.
iii. Introducing unique Identifier Primary key to the original data table which would
be the First Normal Form.
iv. Removing duplicate data by introducing new table having two columns, Crime
ID and Crime sub Group which would be the Second Normal Form.
v. Now, there is the need of removing the character value from the integer column.
vi. Decimal values are removed from the same.

The summary of the normalized data can be illustrated below:

Number of specification Corresponding Values


Total records 79,272
Character values (x) 204
Decimal value (.) 6
Total normalized records 72062
Total number of offences 89,273,147

Hence the final Normalized data was obtained for the research.

Data Clustering:

Introduction:

The data clustering algorithms implemented for this research are:

1. K-Means Algorithm.

K-means is a simple learning algorithm that solves the clustering problem.


This algorithm aims to minimize the objective function, squared error function. The
objective function is:

Where is a distance measured between a data point and the cluster centre
, is an indicator of the distance of the n data points from their respective cluster centres.

The procedures for K-means are as follows:

i. Firstly place k points into the space s represented by the objects that are being
clustered. These points represent the initial centroid for the groups.
ii. Through the distance measured, assign each object to the group having closest
centroid.
iii. After assigning all the objects, recalculate the centroid position.
iv. Repeat step ii to iii until the centroids no longer moves from their previous
position. This forms a separated group of corresponding objects.

S-ar putea să vă placă și