FinalReport PDF

PREDICTION OF ENVIRONMENTAL
POLLUTION USING METAHEURISTICS

A Project Report
submitted in partial fulfillment for the requirements of the award of the degree
of
BACHELOR OF TECHNOLOGY
IN
INFORMATION TECHNOLOGY
Submitted by
CHARUL BISHNOI (1502913029)
SHIVANGI AGARWAL (1502913092)
SHRISHTI JAIN (1502913099)
SAMARTH GOEL (1502913079)
Supervised by
Prof. MOHIT AGARWAL
DEPARTMENT OF INFORMATION TECHNOLOGY

KIET GROUP OF INSTITUTIONS, GHAZIABAD, UTTAR PRADESH
(Affiliated to Dr. A. P. J. Abdul Kalam Technical University, Lucknow, U.P., India)
Session 2018-19
DECLARATION
We declare that
a. the work contained in this report is original and has been done by us under the
guidance of our supervisor.
b. the work has not been submitted to any other institute for any degree or
diploma.
c. we have followed the guidelines provided by the institute to prepare the report.
d. we have conformed to the norms and guidelines given in the ethical code of
conduct of the institute.
e. wherever we have used materials (data, theoretical analysis, figures and text)
from other sources, we have given due credit to them by citing them in the text
of the report and giving their details in the references.
Signature of the student

Name: Charul Bishnoi
Roll number: 1502913029

Name: Shivangi Agarwal

Name: Shrishti Jain

Name: Samarth Goel
Place: KIET Group of Institutions, Ghaziabad

Date:
i
CERTIFICATE
This is to certify that the project Report entitled, “Prediction Of

Environmental Pollution Using Metaheuristics” submitted by Charul
Bishnoi, Shivangi Agarwal, Shrishti Jain and Samarth Goel in the
Department of Information Technology of KIET Group of Institutions,
Ghaziabad, affiliated to Dr. A. P. J. Abdul Kalam Technical University,
Lucknow, Uttar Pradesh, India, is a record of bonafide project work carried out
by them under my supervision and guidance and is worthy of consideration for
the award of the degree of Bachelor of Technology in Information Technology
of the Institute.
Signature of Supervisor:
Supervisor Name: Prof. Mohit Agarwal
Date:
ii
LIST OF FIGURES
1.1 Smog in Delhi 3

1.2 Harvest burning in Haryana 3
2.1 Point Anomaly 12
2.2 Context Anomaly 12
2.3 Collective Anomaly 13
2.4 Flow chart of k-means algorithm 14
2.5 Flow chart of PSO Algorithm 17
3.1 Architecture of the Proposed Model 18
3.2 The value of two dimensions (x, y) after using PCA 22
4.1 RMSE and MSE value of Euclidean Distance 24
4.2 RMSE and MSE value of Spearman Distance 24
4.3 RMSE and MSE value of Manhattan Distance 24
4.4 RMSE and MSE value of Pearson Distance 24
4.5 RMSE and MSE value of Chebyshev Distance 24
4.6 Correlation between gases 25
4.7 Cluster formation using k-means 25
4.8 Cluster formation using K-means2 26
4.9 Silhouette, SSE, quantization error values of k-means 26
4.10 Silhouette, SSE, quantization error values of k-means2 26
4.11 Cluster formation using PSO 1 27
4.12 Cluster formation using PSO 2 27
4.13 Gbest score of PSO 1 27
4.14 Gbest score of PSO 2 28
4.15 Graph showing the concentration of gases in air 28
iii
LIST OF TABLES
1.1 Comparative study of the related research work 4

3.1 Dataset 19
iv
LIST OF ACRONYMS
PSO Particle Swarm Optimization

AQHI Air Quality Health Index
MSE Mean Square Error
RMSE Root Mean Square Error
CRB Crop Residue Burning
API Air Pollution Index
WAPMS Wireless Sensor Network Air Pollution Monitoring System
RCQ Recursive Converging Quartiles
ANN Artificial Neural Network
MTL Multi Task Learning
PCA Principal Component Analysis
gbest Global Best
SSE Sum of Squared Errors
v
ABSTRACT
Air pollution is increasing day by day, decreasing the world economy, degrading the
quality of life and resulting in a major productivity loss. At present, this is one of the
most critical problem. It has a significant impact on human health and
ecosystem. Reliable air quality prediction can reduce the outcome of a pollution on
the nearby population and ecosystem, hence improving air quality prediction is the
prime objective for the society. Clustering is the data analysis procedure that is used
to inspect and interpret the vast collection of data. Outlier detection is used to find out
the anomalies in the data that do not meet the presumed performance. Outlier
detection helps us to find out the two types of data points, that is normal and outliers
(the points that behave differently from normal points). By finding out these data
points we can find out the major cause of problem. The effectiveness of the outlier
detection methods contingent to a large scale on the ability of the underlying
clustering method. The ability of clustering methods in turn depends on the effective
and efficient choice of parameters like initial centroids and number of clusters
etc. Although nature inspired metaheuristic algorithms have shown a lot of promise as
optimizers for various processes, this has not been adequately explored in the existing
approaches to improve effectiveness of clustering algorithms. Clustering methods
bear from ambiguity when the dataset consists of groups with various composite
shapes and outliers. Furthermore, nature inspired algorithms have also been developed
to discover the exact solution of clustering issues. These algorithms provide better
quality results in comparison to traditional algorithms. For the same, Particle Swarm
Optimization(PSO) technique can be easily used for optimizing the clusters formed
thus fulfilling the goal of this research in an effective way.
vi
CONTENTS
Page No.
DECLARATION i
CERTIFICATE ii
LIST OF FIGURES iii
LIST OF TABLES iv
LIST OF ACRONYMS v
ABSTRACT vi
CHAPTER 1: INTRODUCTION 1
1.1 Background 1
1.2 Motivation 2
1.2.1 Case Studies 3
1.3 Literature Review 4
1.4 Research Gaps 9
1.5 Research Objectives 11
CHAPTER 2: CLUSTERING AND METAHEURISTICS 12
2.1 Outliers and Outlier Detection 12
2.1.1 Types of Anomalies 12
2.2 Clustering 13
2.2.1 K-Means Flowchart 13
2.3 Principal Component Analysis 15
2.3.1 Working of PCA 15
2.4 Metaheuristics 15
2.4.1 Particle Swarm Optimization 16
CHAPTER 3: EXPERIMENT ANALYSIS 18
3.1 Model Architecture 18
3.2 Solution Algorithm 19
CHAPTER 4: RESULT AND DISCUSSION 24
vii
CHAPTER 5: CONCLUSION 29
REFERENCES 30
APPENDIX A: PLANNING OF WORK 34
APPENDIX B: CODE 35
viii
Chapter 1
Introduction
1.1 Background
With the evolution of the economy and society everywhere on the planet, the world is
experiencing increased concentrations of air pollutants. Currently, the environmental
downside is the most severe problem that features a major inﬂuence on human health
and ecosystem. Various efforts are placed by government towards the management of
pollution, and much success has been obtained within the same. Air pollutants are
emitted mostly by business factories and vehicles due to the increasing usage of
hydrocarbon and different petrochemicals and fossil fuels[21]. Human health problem
is one of the necessary consequences of air pollution, particularly in urban areas. The
global warming from phylogeny greenhouse gas emissions may be a long run
consequence of air. Correct air quality prediction can cut back the effect of a pollution
peak on the encircling population and ecosystem, hence rising air quality prediction is
a very important goal for society [19].
An air waste may be a matter within the atmosphere which will have severe
impacts on society and therefore on the entire surrounding. The matter can be
something like solid-particles, liquid-droplets or gases. There are many positions,
tasks or factors that are chargeable in the emission of pollutants in the
surroundings. The sources or origin of pollutants can be categorised in 2 main classes
–Anthropogenic (phylogeny) sources and Natural sources. Phylogeny sources are
largely related with the burning of various sort of fuel such as smoke stacks of fossil
fuel power stations, motor vehicles, marine vessels, aircraft, nuclear weapons, toxic
gases. Natural sources include dust, smoke, volcanic activities and ash particulates
[18].
Environmental pollution prices the global wealth $5 trillion every year as an outcome
of efficiency deprivation & deteriorating standard of living. These productivity
deprivation are caused by demise because of disorder caused by environmental
pollution. One out of ten demises is caused by ailment related to atmospheric
pollution and therefore the situation is becoming unsatisfactory. This scenario is
increasing in the developing society [17].
Page 1 of 35
The Air Quality Health Index (AQHI) is a public information method developed to
help perceive the impact of environmental condition on society. The AQHI is deﬁned
as an indicator or ranking scale ranges from one to ten plus supported mortality study
that gives the amount of effect on health related to native atmosphere quality. The
more high the amount, larger the effect on health and the necessity to perform certain
actions [1].
Most recent air quality prediction uses effortless methods viz box models, Gaussian
models and linear statistical models. All of the above models are quite simple to
implement and enable for the quick calculation of predictions. Nevertheless, they
normally don‟t describe interactivity & non-linear relationships that command the
transfer and nature of adulterants in air. With these provocations, machine learning
approaches like outlier detection have become favoured in air quality prediction and
other environmental problems [20]. Outlier detection has been used for ages to find
and, where possible, eliminate ambiguous outputs from data. Outliers emerge because
of mechanical defects, changes in system behaviour, fraudulent nature, human
mistakes, instrumental mistakes or solely by genuine deflection in the society. Their
observation can determine glitches before they will intensify with the probable
disastrous results. It results in finding out the errors and reduce their contaminating
impact on the complete set of data. This data is then as such purified for further
processing. The technique of outlier detection has its application in so many fields and
some of them are frauds-detection, intrusions-detection for cyber-security, insurance
etc. [1].
Due to the proven aspect that outliers are one of the origin to largely affect the data
quality, in this regard we offer a complete summary of the research done in the field
of outlier detection in prediction of environmental pollution, assess and contrast
present outlier detection methodologies mainly designed for air quality prediction,
and find out the possible areas for more analysis [2].
1.2 Motivation
Atmospheric pollution is the main problem all around the world, and current statistics
in India, China and other countries featured the ceilings being surpassed.
Page 2 of 35
1.2.1 Case Studies:
a) New Delhi: Air pollution in Delhi worsened. The toxic smog continued to
envelope the national capital with reduced visibility in some areas as can
be seen in Figure 1.1 [6].
Figure 1.1 Smog in Delhi
b) Haryana: The muddy air, with less wind to scatter it, partially originates
from the yearly post-harvest blazing of crop straw in Punjab and Haryana
as shown in the below Figure 1.2. The extent of hazardous pollutants in the
atmosphere has increased. Blazing of agricultural fuel remaining, or Crop
Residue Burning (CRB) has been found as a prominent health threat [7].
Figure 1.2 Harvest burning in Haryana
Page 3 of 35
1.3 Literature Review
The following Table 1.1 shows the comparative study of the various research work
done from year 2009 to 2018 that is similar to our topic of research. It clearly
mentions the findings and limitations of each work.
Table 1.1 Comparative study of the related research work
Study And Year Findings Disadvantages

Air pollution forecasting in The real goal of this The AQI failed to record
Ankara, Turkey using air examination was to give pollution levels sometimes
pollution index and its data on the commitment of because of lack of
relation to assimilative traffic outflows sufficient data on air
capacity of the onTresidential sitesTand to pollution.
atmosphere,2009 assess relative
commitments of
emanations and
meteorology on estimated
contamination focuses.
The air pollution index
(API) for the city was
determined and utilized,
alongside meteorological
parameters, to conjecture
the following day's API.
Determined API was
likewise contrasted and
hourly ventilation
coefficient, which is a
pointer of the scattering
capability of the
atmosphere[23].
A Wireless Sensor This examination proposed This method have limited

Network Air Pollution an imaginative framework storage and computation
Page 4 of 35
Monitoring System , 2010 named Wireless Sensor capabilities
Network Air Pollution
Monitoring System
(WAPMS) to screen air
contamination in Mauritius
using remote sensors sent
in enormous numbers
around the island. So as to
improve the productivity
of WAPMS, they have
planned and executed
another information
conglomeration calculation
named Recursive
Converging Quartiles
(RCQ). The calculation is
utilized to blend
information to dispose of
copies, sift through invalid
readings and outline them
into a more straightforward
structure which
fundamentally decrease the
measure of information to
be transmitted to the sink
and consequently sparing
energy[24].
NeuroFuzzy Modeling The principle goal of this Neural system
Scheme for the Prediction paper is two folds. The demonstrating plan gives a
of Air Pollution,2010 first is to create Fuzzy productive computational
controller plot for the apparatus to mapping
expectation of the input-yield or cause-
changing for the NO2 or impact connections.
Page 5 of 35
SO2, over urban zones
dependent on the
estimation of NO2 or SO2
over characterized modern
sources. The second is to
build up a neural net, NN
conspire for the
expectation of O3
dependent on NO2 and
SO2 measurements[25].
Real-time Air Quality This paper shows a It depends on a consistent

Monitoring Through vehicular-based portable solid stream of
Mobile Sensing in methodology for contamination estimations.
Metropolitan Areas,2013 estimating fine-grained air Additionally it depends on
quality progressively. It the members to produce
proposes two cost effective information.
data farming models – one
that can be conveyed on
open transportation and the
second an individual
detecting device[26].
Artificial neural networks In this paper a novel Due to the severe air
forecasting of PM2.5 hybrid model pollution, the overall
pollution using air mass consolidating air mass accuracy of the prediction
trajectory based direction examination and is not so satisfactory.
geographic model and wavelet change is work to
wavelet transformation, improve the artificial
2015 neural network (ANN)
figure exactness of every
day normal groupings of
PM2.5 two days ahead of
Page 6 of 35
time is presented[27].
A Machine Learning This exploration created This approach needs a lot
Approach for Air Quality effective machine learning of training data for future
Prediction: Model method for air toxin prediction
Regularization and forecast. They have
Optimization,2018 defined the issue as
regularized MTL(Multi
undertaking learning) and
utilized propelled
enhancement calculations.
They propose refined
models to foresee the
hourly air contamination
focus based on
meteorological information
of earlier days by detailing
the expectation more than
24 h as a perform multiple
tasks learning (MTL)
problem[28].
Outlier Detection in Urban This exploration build up a Master information is

Air Quality Sensor novel outlier detection required to assess every
Networks ,2018 strategy dependent on a exception and settle on its
spatio-fleeting order, treatment.
concentrating on hourly
NO2 focuses. It partitions
an entire year's perceptions
into 16 spatio-fleeting
classes. Foreach spatio-
worldly class, we
distinguish anomalies
utilizing the mean and
Page 7 of 35
standard deviation of the
typical conveyance hidden
the truncated ordinary
circulation of the NO2
observations[29].
Sensor-Based The exploration is on There is no global
Optimization Model for sensor-based home IoT, standard of similarity for
Air Quality Improvement examines on indoor air the labeling and checking
in Home IoT,2018 quality, and specialized equipment. Similarly, as
examinations on arbitrary with every single complex
data generation. It framework, there are more
additionally builds up an chances of
air quality improvement disappointment.
model that can be
promptly connected to the
market. As needs be, we
produce related
information dependent on
user conduct esteems. We
coordinate the rationale
into the current home IoT
framework to empower
clients to effectively get to
the framework through the
Web or portable
applications[30].
Air quality forecasting This is the principal The examinations utilize
based on cloud model endeavour to foresee AQIs free open information, If
granulation,2018 by cloud model progressively more
granulation. This strategy substance of chemical and
changes the issues in the meteorological conditions
first data space to the data got, the investigation
element space and from may get increasingly exact
Page 8 of 35
highlight space to idea expectation results.
space. At last, the answer Second, it supplanted the
for the first issue is missing information with
practiced by consistent the mean or most. It might
information reduction[31]. prompt one-sided results.
1.4 Research Gaps

After extensive literature review, we were able to identify some marked gaps in the
approaches being followed.
a) Similarity is a crucial component in clustering based data mining approaches.

The approaches that have mostly been followed involve finding either object
similarity or attribute similarity. The problem of finding similarity or
dissimilarity of two subsets of data or of two time series out of a time series
database has not been focused enough. One possible application area of such a
dissimilarity measure can be the clustering of patients based on their data
which can be thought of as time series of their medical observations. A
properly designed dissimilarity measure that can compare two time series or
two subsets of a given time series can not only help in identifying an
anomalous time series but can also help in identifying collective anomalies in
a given time series.
b) The effectiveness of the outlier detection methods rely majorly on the ability
of the underlying clustering method. The ability of clustering method in turn
depends on the effective and efficient choice of parameters like initial
centroids and number of clusters etc. Although nature inspired metaheuristic
algorithms have shown a lot of promise as optimizers for various processes,
this has not been adequately explored in the existing approaches to improve
effectiveness of clustering algorithms.
c) Most of the existing outlier detection methodologies label the data as normal
and anomalous, which is fine if the stage succeeding the outlier detection is a
classification or alert generation stage. But if the results of the outlier
detection need to be processed further, some outlier score needs to be
Page 9 of 35
generated which should closely relate to the “unusualness” of the specific data
instances. For instance, if using nearest neighbour based clustering, a outlier
score can be established as number of neighbours for a specific 'n' parameter.
The outlier scores that have been suggested in existing methods are very
trivial. A well-defined outlier score can also be used as an indicator for
proactive alerting and may reduce the time lag between occurrence of outlier
and its detection thereby increasing the efficiency of the system. Moreover the
problem of increased false positives upon changes in working conditions can
be simplified to a tuning of threshold outlier scores.
d) The complexity of the process of outlier detection in time series data is
affected to a great extent by the curse of dimensionality. The higher the
dimensionality of the data series, the more time, effort and power needed to
analyse the data so as to be able to find out either the anomalies in a time
series or a subsequence out of a time series database. Current approaches have
the limitations that they are not able to identify the dimensions that effectively
represent the data series and need to be preserved.
e) Outlier detection methods are affected largely by the problem of False
positives. This not only reduces the efficiency of detecting the actual outliers
but also has an adverse effect on the effectiveness and acceptability of the
detection methodologies. The existing methods lack in both sensitivity and
specificity. There are some existing ways to improve precision by using
methods like generalization but these methods have the downside that they
may cause loss of accuracy due to over generalization. An optimal
methodology must increase the accuracy and precision in a way that can be
measured tangibly.
f) Most of the outlier detection methods in use today are not designed to handle
the huge volumes of data that are generated today. In an era of Big-Data, the
currently existing methods of outlier detection are finding themselves
inadequate. We need to develop an approach that is not only effective and
efficient with existing data, but can also be successfully scaled for handling
big data.
Page 10 of 35
1.5 Research Objectives
Based on the various research gaps identified in the literature survey phase, this
research strives to overcome at least some of the gaps thereby trying to develop an
effective and efficient methodology for detecting outliers.
a) The proposed research intends to find out the most efficient distance method
which can be used in the application of k-means algorithm.
b) The proposed research also intends to apply an appropriate optimization
methodology so as to improve the performance of the data mining algorithm
that is forming the basis for the outlier detection process.
c) The proposed research also endeavours to identify an efficient method to
distinguish those attributes that are representative of causing the major
problem of air pollution. We would attempt to develop an optimum scalable
methodology that can cope with the large amounts of data being generated
today.
(For Planning Of Work Please Refer Appendix A)
Page 11 of 35
Chapter 2
Clustering and Metaheuristics
2.1 Outliers and Outlier detection

Outlier detection is the process of figuring out the data-points that do not content the
required conditions. These non-satisfying patterns are usually called as anomalies,
novelties, outliers, or, exceptions. Anomalies and Outliers are the two most
commonly used terms [8].
2.1.1 Types of Anomalies

a) Point Anomaly: There can be a situation where discrete data-pointtis
anomalous in comparison to the rest of the data. In this situation these discrete
points are known as a point anomaly as shown in Figure 2.1 [9].
b) Context Anomaly: There can be situation where the data-point is an

outlier in a specific scenario. This situation is referred to as a contextual
anomaly as shown in Figure 2.2 [9].
c) Collective Anomaly: There can be situation where a group of related data-

point is different with context to the whole data set. This situation is known as
collective anomaly as shown in Figure 2.3 [9].
Figure 2.1 Point Anomaly Figure 2.2 Context Anomaly
Page 12 of 35
Figure 2.3 Collective Anomaly
2.2 Clustering
Clustering is one of the significant data analysis procedure which is utilized to
investigate and comprehend the huge collection of data. Clustering has demonstrated
its probability in different fields like bioinformatics, pattern acknowledgement,
picture processing, medical mining and some more. The purpose of the clustering is to
organize objects into clusters dependent on the estimations of their qualities.
Recently, numerous analysts have a noteworthy enthusiasm in developing clustering
algorithms [10]. The issue in clustering is that we don‟t have early information
knowledge about the given dataset. Also, the decision of input parameters such as the
number of clusters, number of closest neighbours, halting criteria and other factors in
these algorithms make the clustering increasingly challengeable. These algorithms
also experience the ill effects of unsuitable precision when the dataset contains
clusters with various complex shapes, densities, and anomalies. Furthermore, nature
based algorithms have also been developed to find the exact solution of clustering
problems. These algorithms provide better quality results in comparison to traditional
algorithms [11].
2.2.1 K-Means Flowchart

K-means Algorithm is one of the broadly utilized unsupervised clustering
calculation. This calculation pursues basic strides to discover the group in a
given dataset as shown in Figure 2.4. It utilizes the idea of centroids and
subsequently relegates the datapoints into the most suitable group based on
separation technique like euclidean, manhattan and so forth. The focuses
which does not lie in any of the groups is the anomaly for that dataset [12].
Page 13 of 35
In this calculation the k-focuses change their areas well-ordered until no more
changes are done or in other words until focuses don't move any more.
The formula to ascertain the new cluster center is:
where,
is the number of data points in the ith cluster and
is the data point of a cluster
Start
Number Of
Clusters K
False
Determine the
Centroids
No object
move End
Determine the distance of group True
objects from centroids
Group the points on the basis

of the minimum distance
Figure 2.4 Flow chart of k-means algorithm
Page 14 of 35
In the k-means algorithm we initially need to choose the number of clusters. Then we
take the centroid for each cluster randomly from the dataset. After finding the centroid
all the data points are alloted to one of the groups based on the minimum distance.
Then the mean of each cluster is calculated and this mean value is taken as the
updated centroid value. The complete process continues until we get the same clusters
in two iterations [13].
2.3 Principal Component Analysis
PCA is a measurable methodology that uses an orthogonal transformation to change

over a lot of perceptions of potentially related factors into a set of values of directly
uncorrelated variables called principal components. It is a technique of summarizing
data. It is basically used to reduce the dimensions of a large dataset. It is also used to
ensure whether the variables are statically independent of each other [5].
2.3.1 Working of PCA
a) Compute the covariance matrix X of data-points.

b) Compute eigen vectors and related eigen values.
c) The eigen vectors as per their eigen values in diminishing request.
d) Pick first k eigen vectors and that will be the new k measurements.
e) Change the first n dimensional data-points into k measurements.
Covariance matrix used in PCA is symmetric. The resulting dimensions are linear and
independent in nature and they can losslessly represent the data points. The new
measurements enable us to reproduce the first measurement and the remaking error
ought to be limited [32].
2.4 Metaheuristics
Metaheuristics are the algorithms motivated by nature and utilized for optimizing
problems to get the most expected result. A metaheuristic is a general algorithmic
structure that can be applied to various optimization issues with moderately couple of
changes to make them adjusted to a particular issue. Instances of metaheuristics
incorporate simulated annealing (SA), tabu search(TS), iterated local search(ILS),
evolutionary algorithms(EA), and ant colony optimization(ACO), particle swarm
Page 15 of 35
optimization(PSO). The objective of metaheuristics is to effectively investigate the
inquiry space so as to discover near–optimal solutions [33].
2.4.1 Particle Swarm Optimization

Particle swarm optimization (PSO) is a computational strategy that streamlines
an issue by iteratively endeavoring to improve a competitor arrangement with
respect to a given proportion of value. This calculation is enlivened by the idea
of a swarm of looking nourishment.
Assume there is a swarm (a gathering of flying creatures). Presently, every one

of the birds is hungry and is scanning for sustenance. In the region of these
birds, there is just a single sustenance molecule. This sustenance molecule can
correspond with an asset. As we are aware, undertakings are many, assets are
constrained. So this has turned into a comparative condition as in a specific
calculation condition. Presently, the fowls don't have the foggiest idea where
the nourishment molecule is covered up or found. In such a situation, how the
calculation to discover the sustenance molecule ought to be planned. On the
off chance that each bird will endeavour to discover the nourishment all alone,
it might cause ruin and may expend a lot of time. Consequently on cautious
perception of this swarm, it was understood that however the winged creatures
don't have a clue where the sustenance molecule is found, they do know their
separation from it. Hence the best way to deal with finding that nourishment
molecule is to pursue the birds which are closest to the sustenance molecule.
This conduct of birds is re-enacted in the calculation condition and the
calculation so structured is named as Particle Swarm Optimization Algorithm
[34]. A similar calculation can be effectively utilized in software engineering.
For example, in our research, we endeavoured to discover the real reason for
contamination utilizing this calculation.
2.4.1.1 PSO Flowchart

In PSO algorithm as shown in Figure 2.5, we first have to initialize the PSO
parameters. Then the first swarm is generated. In the wake of producing the
swarm, the fitness estimation of the considerable number of particles is
determined and assessed. The best fitness estimation of the considerable
number of particles is recorded. Then from the fitness values, the global best
Page 16 of 35
particle is selected. Then it is checked whether the swarm met the termination
criteria or not. If not then update the position of the particles and again
calculate and evaluate the fitness of all the particles. If yes then the algorithm
ends. This algorithm is inspired by the nature of swarm of searching food. The
same algorithm can be easily used in computer science. For instance in our
research we tried to find out the major cause of pollution using this algorithm
[14].
Start
Initialize PSO parameter
Generate first
swarm
Evaluate the fitness of all

particles
Record personal best fitness

of all particles
Update the position of particles

Find global best particle
Swarm met
No
the
termination
criteria?
Yes
Yes
End
Figure 2.5 Flow chart of PSO Algorithm
Page 17 of 35
Chapter 3
Experiment Analysis
3.1 Model Architecture

The planning of our proposed model is given in the below Figure. 3.1. Air quality
dataset is used as input to the k-means, PSO 1 and PSO 2 algorithm, which gives the
name of the gas that is the major cause of pollution as output.
Apply K-means
Apply PCA
Apply K-means
using the most
Determine the gas
efficient
which is most
Dataset distance
responsible for air
method
pollution
obtained from
step 2
Apply PSO 1
Apply PSO 2
INPUT OUTPUT
Figure 3.1 Architecture Of The Proposed Model
Page 18 of 35
3.2 Solution Algorithm
Following are the steps for solution methodology:
Step 1 Define the input dataset
Step 2 Apply k-means to find the most efficient distance method
Step 3 Apply PCA to reduce the dimension of the dataset
Step 4 Apply k-means using the most efficient distance method obtained from step 2
Step 5 Apply PSO 1 to get the optimized and better results
Step 6 Apply PSO 2
Step 7 Determine the gas which is the major cause of pollution
Each steps of the proposed methodology is explained in the following paragraphs.
The Air Quality dataset is used as input which consist of the data about the
concentration of the gases present in the air. The data used is the multidimensional
data. The dataset have 13 columns (CO_GT, PT08_S1_CO, NMHC_GT, C6H6_GT,
PT08_S2_NMHC, Nox_GT, PT08_S3_Nox, NO2_GT,PT08_S4_NO2,PT08_S5_O3,
T, RH, AH) and 9358 rows corresponding to each column as shown in Table 3.1.
Table 3.1 Dataset
CO_ PT0 NM C6 PT0 N0x PT08 NO2 PT0 PT08 T RH AH

GT 8_S1 HC_ H6_ 8_S2 _G _S3_ _GT 8_S4 _S5_
_CO GT GT _NM T Nox _NO O3
HC 2
2 6 1360 150 11 9 1046 166 1056 113 1692 1268 13
2 1292 112 9 4 955 103 1174 92 1559 972 13 3
2 2 1402 88 9 0 939 131 1140 114 155 1074 11
1 2 1376 80 9 2 948 172 1092 122 1584 1203 11
1 6 1272 51 6 5 836 131 1205 116 1490 1110 11
The complete experiment is divided into two parts.
Page 19 of 35
In the first part, Kmeans is applied with some modifications. Kmeans basically uses
Euclidean distance method for calculating the separation between 2 datapoints and on
basis of that cluster formation or analysis is done. Here, K-means is modified as rather
than using the Euclidean distance directly, five different distance formulas are used in
order to find out the efficient one. The different formulas used in this research are as
follows:
a) Euclidean Distance Metric
The equation for calculating this distance between 2 points (X, Y) is:
d= ∑
The Euclidean distance function measures the „as-the-crow-flies‟ distance

[16].
b) Manhattan
This distance method is used if a grid-like path is followed. The equation for
this distance between 2 points (X, Y) is:
d=∑ | |
where,
n is the number of variables, and Xi and Yi are the values of the ith variable, at
points X and Y respectively [15].
c) Pearson Correlation
It calculates the similarity in shape between 2 profiles. The formula for the this
distance is:
d = 1- r
where
r = Z(x).Z(y)/n
is the dot-product of the z-scores of the vectors x, y. The z-score of x is

constructed by subtracting its mean from x and dividing by the standard
deviation [15].
d) Chebychev
Page 20 of 35
The Chebychev distance between 2 points is the maximum span between the
points in any one dimension. The separation between 2 points (X, Y) is find
out using the formula:
Maxi | |
here Xi, Yi are the estimates of the ith variable at points X, Y respectively
[16].
e) Spearman Rank Correlation
Spearman Rank Correlation finds out the interdependency between 2

sequences of values. The two sequences are classified separately on the basis
of the rank and the differences in rank are calculated at every position, i.e. the
separation between 2 sequences X= (X1, X2, etc.) and Y= (Y1, Y2, etc.) is
calculated using the following formula:
∑
1-
where Xi and Yi are the ith values of sequences X and Y respectively.
The span of Spearman Correlation lies in the range from -1 to 1. Spearman

Correlation can detect certain linear and non-linear interdependencies [15].
The efficiency of these distance formula is measured on the basis of the MSE and
RMSE values. The output of this experiment is the MSE and RMSE values. The
lower the RMSE value, the more efficient the distance formula is. So, after getting the
distance formula (that is the most efficient among all) it would be used for the further
implementation of K-means and cluster formation. It means that instead of using the
Euclidean formula in K-means for distance calculation, now we will use that distance
formula whose RMSE value comes out to be the lowest.
In the second part, the main aim is to optimize the k-means algorithm using PSO and
the distance method used in k-means is euclidean distance (as obtained from step 2).
For this, we first dropped the null values from the dataset and normalized it in order to
reduce the redundancy. Then, we applied PCA to reduce the dimensions from 10 to 2
(as shown in Figure 3.2) and to find the correlation between variables. Now, the
Page 21 of 35
variables can be plotted in the form of x and y axis only and the clusters could be
made efficiently.
Figure 3.2 The value of two dimensions (x,y) after using PCA
In the next step we applied k-means clustering algorithm where the number of clusters
are 3. In this algorithm we have used a boolean variable init_pp and two functions k-
means and k-means2. If the value of init_pp=True then k-means 2 is called and the
centroid of the clusters are selected randomly using the random function and if the
value of init_pp=False then k-means is called and the centroid of the clusters are
selected randomly from the dataset. After finding the centroids we have used fit
function which is used to assign the different values of the data to different clusters on
the basis of the distance and centroids of each cluster are updated. Finally, we have
calculated the silhouette score, SSE and quantization error and plotted the clusters for
the same. By this we concluded that k-means 2 works better than k-means as can be
seen from the values of silhouette score, SSE and quantization error.
Next we have applied PSO Algorithm using two different approaches. The two
approaches are named as PSO 1 and PSO 2. Also, in this algorithm we have used a
boolean variable use_kmeans. If the value of use_kmeans is True then the centroids
obtained from k-means are replaced by the centroids obtained from the PSO 1
algorithm i.e. the end condition of k-means becomes the start condition of PSO 1.If
the value of use_kmeans is False then the centroids obtained from k-means are
optimized by using PSO 2 algorithm i.e this PSO 2 optimize the centroid obtained at
each step. Finally, the two approaches are compared on the basis of their gbest score
which represent the span between data point and cluster head. The minimum is the
value of gbest score, the better is the method of optimization [22].
Page 22 of 35
Now to predict the major gas responsible for air pollution we select any cluster and
find out the gases present in that cluster. After that we find the mean of all the gases
individually and plot the bar graph showing the mean concentration of each gas.
Page 23 of 35
Chapter 4
Result and Discussion
The result of the step 2 of proposed solution algorithm is that the RMSE value of the
Euclidean distance is the lowest as can be seen from the below attached outputs. It
implies that in the further implementation, the Euclidean distance will be used in K-
means for clusters formation. Figure 4.2-4.5 shows the RMSE and MSE values of
various distance methods used.
Figure 4.1 RMSE and MSE value of Euclidean distance
Figure 4.2 RMSE and MSE value of Spearman distance
Figure 4.3 RMSE and MSE value of Manhattan distance
Figure 4.4 RMSE and MSE value of Pearson distance
Figure 4.5 RMSE and MSE value of chebyshev distance
The result of using PCA is the correlation matrix as can be seen in Figure 4.6.
Page 24 of 35
Figure 4.6 Correlation between gases
The scale of correlation is such that the light colours represent the more correlation
between two variables whereas the dark colours represent the least correlation
between two variables. The high value of correlation means that the two gases will be
present together in every situation.
Further, on applying k-means algorithm we get the clusters formed as shown in Figure
4.7. The black points in the image shows the cluster head. Here, the centroid of the
clusters are chosen unusually from the dataset. The datapoints that don‟t exist near
any clusters are the anomalies in the dataset.
Figure 4.7 Cluster formation using k-means
Next, on applying k-means 2 algorithm we get the clusters formed as shown in below
Figure 4.8. The blackish dots in below image shows the cluster head. Here, the
Page 25 of 35
centroid of the clusters are selected randomly using the random function. The
datapoints that don‟t exist near any clusters are the anomalies in the dataset.
Figure 4.8 Cluster formation using k-means 2
The above two approaches can be compared using silhouette score, sse, quantization
error. From the below Figure 4.9 and Figure 4.10 it can be concluded that k-means 2
is better than k-means as the lesser is the value of sse the better is the method.
Figure 4.9 Silhouette, SSE, Quantization Figure 4.10 Silhouette, SSE, Quantization
Error values of k-means Error values of k-means 2
Now, since in PSO 1 the centroids obtained from k-means are replaced by the
centroids obtained from the PSO 1 algorithm so, the clusters formed by PSO 1 can be
shown as below Figure 4.11. Also, in PSO 2 the centroids obtained from k-means are
optimized by using PSO 2 algorithm and thus the clusters formed by PSO 2 can be
shown as below in Figure 4.12.
The above two PSO approaches can be compared using gbest score which represent
the span between data point and cluster head. The minimum is the value of gbest
score, the better is the method of optimization.
Page 26 of 35
Figure 4.11 Cluster Formation using PSO 1
Figure 4.12 Cluster Formation using PSO 2
Figure 4.13 Gbest score of PSO 1
Page 27 of 35
Figure 4.14 Gbest score of PSO 2
From Figure 4.13 and Figure 4.14 it is concluded that PSO 1 is better than PSO 2 in
our case as the value of gbest score of PSO 1 less as compared to that of PSO 2.
Now, the below bar graph shows the concentration level of the various gases present
in the air. So, from this graph we can conclude that the gas PT08_S5_O3 is the major
cause of air pollution as it is present in the atmosphere in the highest concentration
thus depleting the quality of the air.
Figure 4.15 Graph showing the concentration of gases in air
Page 28 of 35
Chapter 5
Conclusion
The proposed research is to find out the gas which is most responsible for the air
pollution. We started to achieve our aim by first implementing the k-means algorithm
using five different distance methods. We compared the distance methods on the basis
of their RMSE values. The lesser the RMSE value the better is the distance method.
So, from this we concluded that Euclidean distance is the best as it‟s RMSE value is
the least among all. Then we used PCA to reduce the dimensions of our dataset from
10 to 2. This reduced dataset is further used to implement the k-means and PSO
algorithms. We used two approaches to implement k-means in our solution. In the
first approach i.e. k-means, the centroids are taken randomly from the dataset whereas
in the second approach i.e. k-means 2, the centroids are taken randomly using the
random function. Using k-means and k-means 2 we plotted the graphs showing the
cluster formation which thus helps to determine the outliers. Also, the two approaches
were compared on the basis of their silhouette score, sse and quantization error values
and the conclusion was made that k-means 2 is better than k-means because the value
of silhouette score, sse and quantization error is less in case of k-means 2. Further we
optimized our results using the PSO algorithm. PSO algorithm is also implemented
using two different approaches i.e. PSO 1 and PSO 2. In PSO 1, the centroids
obtained from k-means are replaced by the centroids obtained from the PSO 1
algorithm whereas in PSO 2 the centroids obtained from k-means are optimized by
using PSO 2 algorithm. Both the PSO approaches were also compared on the basis of
their gbest score. The less is the value of gbest score the better is the optimization
algorithm. Here the gbest score of PSO 1 is less than the gbest score of PSO 2 so, in
our case PSO 1 is better as compared to PSO 2.
Using the above solution methodology, the gas PT08_S5_O3 is found to be most
responsible for the air pollution as the concentration of this gas is maximum among
all the gases. Hence, fulfilling the objective of this research. Now as we know the
reason of air pollution, so we can take the appropriate actions to minimize the
environmental pollution thus improving the quality of air and health of homo sapiens.
Page 29 of 35
References
[1] Air Quality Prediction by Machine Learning Method by Huiping Peng B.Sc., Sun
Yat-sen University, 2013
[2] 2017_5$thumbimg112_May_2017_070949977.pdf
[3]ParticleSwarmOptimizationhttps://en.wikipedia.org/wiki/Particleswarmoptimizatin
,accessed on 9 May 2019 at 5:30 PM
[4] Saptarshi Sengupta *, Sanchita Basak and Richard Alan Peters II, “Particle Swarm
Optimization: A survey of historical and recent developments with hybridization
perspectives”
[5]PrincipalComponentAnalysishttps://en.wikipedia.org/wiki/Principal_component_a
nalysis , accessed on 9 May at 6 PM
[6] Smog continues to envelope Delhi & NCR http://ddnews.gov.in/national/smog-

continues-envelope-delhi-ncr, accessd on 9 May 2019 at 6:15 PM
[7]Crop Burning: Punjab and Haryana‟s killer field

https://www.downtoearth.org.in/news/air/crop-burning-punjab-haryana-s-killer-fields-
55960, accessed on 9 May 2019 at 6:15 PM
[8] A Brief Overview of Outlier Detection Techniques

https://towardsdatascience.com/a-brief-overview-of-outlier-detection-techniques-
1e0b2c19e561 accessed on 9 May 2019 at 7 PM
[9] Outlier Analysis: A Quick Guide to the Different Types of Outliers
https://www.anodot.com/blog/quick-guide-different-types-outliers/, accessed on 9
May 2019 at 7:15 PM
[10] A Tutorial on Clustering Algorithms
https://home.deib.polimi.it/matteucc/Clustering/tutorial_html/ accessed on 9 May
2019 at 8 PM.
[11] Cluster analysis https://en.m.wikipedia.org/wiki/Cluster_analysis accessed on 9
May 2019 at 8:10 PM
Page 30 of 35
[12]k-means clustering algorithm
https://sites.google.com/site/dataclusteringalgorithms/k-means-clustering-algorithm
accessed on 10 May 2019 at 12:30 PM
[13] Understanding K-means Clustering in Machine Learning
https://towardsdatascience.com/understanding-k-means-clustering-in-machine-
learning-6a6e67336aa1, accessed on 10 May 2019 at 2 PM
[14] Implementing the Particle Swarm Optimization (PSO) Algorithm in Python
https://medium.com/analytics-vidhya/implementing-particle-swarm-optimization-pso-
algorithm-in-python-9efc2eb179a6, accessed on 10 May 2019 at 5 PM
[15]Methods for measuring distances
http://www.sthda.com/english/wiki/print.php?id=235, accessed on 10 May 2019 at
5:25 PM
[16] Distance Metrics Overview
http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Param
eters/Distance_Metrics_Overview.htm, accessed on 10 May 2019 at 5:30 PM
[17] Environmental Pollution https://www.sciencedirect.com/topics/earth-and-

planetary-sciences/environmental-pollution, accessed on 10 May 2019 at 6:30 PM
[18] Francisc Popescu and Ioana Ionel, “Anthropogenic Air Pollution Sources”,
August 18th 2010
[19] Global Warming Pollution & Climate Change

https://www.nrcm.org/projects/climate/global-warming-air-pollution/, accessed on 12
May 2019 at 7 PM
[20] A Machine Learning Approach for Air Quality Prediction

https://res.mdpi.com/BDCC/BDCC-02-00005/article_deploy/BDCC-02-00005-
v2.pdf?filename=&attachment=1, accessed on 12 May 2019 at 7:36 PM
[21] Air Pollution https://www.conserve-energy-future.com/causes-effects-solutions-

of-air-pollution.php, accessed on 10 May 2019 at 8:20 PM
[22]Particle Swarm Optimization: what means gBest value?

https://stackoverflow.com/questions/35043560/particle-swarm-optimization-what-
means-gbest-value, accessed on 10 May 2019 at 6:30
Page 31 of 35
[23] D.Deniz GencEmail authorCanan YesilyurtGurdal Tuncel "Air pollution
forecasting in Ankara, Turkey using air pollution index and its relation to assimilative
capacity of the atmosphere,2009", Environmental Monitoring and Assessment July
2010, Volume 166, Issue 1–4, pp 11–27.
[24] Kavi K. Khedo, Rajiv Perseedoss, Avinash Mungur, University of Mauritius,

MauritiusKavi K. Khedo, Rajiv Perseedoss, Avinash Mungur, University of
Mauritius, Mauritius "A Wireless Sensor Network Air Pollution Monitoring System ,
2010".
[25] Tharwat E. Alhanafy*1, Fareed Zaghlool1, and Abdou Saad El Din Moustafa
31Computer and System Engineering Department, Al-Azhar University, Cairo, Egypt
2 Director General, Arab Co. for Engineering & Systems Consultations (AEC)
Certified Lead Auditor "Neuro Fuzzy Modeling Scheme for the Prediction of Air
Pollution,2010".
[26] Srinivas Devarakonda, Parveen Sevusu, Hongzhang Liu, Ruilin Liu, Liviu Iftode,
Badri Nath Department of Computer Science Rutgers University Piscataway, NJ
08854-8091 "Real-time Air Quality Monitoring Through Mobile Sensing in
Metropolitan Areas,2013".
[27] Xiao Feng*, Qi Li, Yajie Zhu, Junxiong Hou, Lingyan Jin, Jingjie WangInstitute
of Remote Sensing and GIS, Peking University, Beijing 100871, China "Artificial
neural networks forecasting of PM2.5 pollution using air mass trajectory based
geographic model and wavelet transformation, 2015"
[28] Dixian Zhu 1*, Changjie Cai 2, Tianbao Yang 3 and Xun Zhou 4 "A Machine
Learning Approach for Air Quality Prediction: Model Regularization and
Optimization,2018"
[29] V. M. van Zoest & A. Stein & G. Hoek "Outlier Detection in Urban Air Quality
Sensor Networks ,2018"
Page 32 of 35
[30] Jonghyuk Kim 1and Hyunwoo Hwangbo "Sensor-BasedOptimization Model for
Air Quality Improvement in Home IoT,2018"
[31] Yi Lin1, Long Zhao3, Haiyan Li4and Yu Sun2 "Air quality forecasting based on
cloud model granulation,2018"
[32] Principal Component Analysis https://medium.com/@aptrishu/understanding-

principle-component-analysis-e32be0253ef0 , accessed on 9 May 2019 at 4:25 PM
[33] Metaheuristics https://en.wikipedia.org/wiki/Metaheuristic , accessed on 9 May

2019 at 4 PM
[34] Particle Swarm Optimization https://www.geeksforgeeks.org/introduction-to-

particle-swarm-optimizationpso/ , accessed on 9 May 2019 at 5:35 PM
[35] Ahmed A. A. Esmin, Stan Matwin, "HPSOM: A HYBRID PARTICLE SWARM

OPTIMIZATION ALGORITHM WITH GENETIC MUTATION," International
Journal of Innovative Computing, Information and Control Volume 9, Number 5,
May 2013
Page 33 of 35
Appendix A
Planning of Work
4
Column2
3 Column1
Months
2
0
Literature Survey Determining Implementation Paper Writing
Objectives
Page 34 of 35
Appendix B
Code
(Available in CD with name pollution_code.zip)
Page 35 of 35

FinalReport PDF

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

FinalReport PDF

Încărcat de

Drepturi de autor:

Formate disponibile

PREDICTION OF ENVIRONMENTAL

POLLUTION USING METAHEURISTICS

CHARUL BISHNOI (1502913029)

SHIVANGI AGARWAL (1502913092)

SHRISHTI JAIN (1502913099)

SAMARTH GOEL (1502913079)

Prof. MOHIT AGARWAL

DEPARTMENT OF INFORMATION TECHNOLOGY

Signature of the student

Signature of the student

Signature of the student

Signature of the student

Place: KIET Group of Institutions, Ghaziabad

This is to certify that the project Report entitled, “Prediction Of

Supervisor Name: Prof. Mohit Agarwal

1.1 Smog in Delhi 3

1.1 Comparative study of the related research work 4

PSO Particle Swarm Optimization

Figure 1.1 Smog in Delhi

Figure 1.2 Harvest burning in Haryana

Table 1.1 Comparative study of the related research work

Study And Year Findings Disadvantages

A Wireless Sensor This examination proposed This method have limited

Real-time Air Quality This paper shows a It depends on a consistent

Outlier Detection in Urban This exploration build up a Master information is

1.4 Research Gaps

a) Similarity is a crucial component in clustering based data mining approaches.

(For Planning Of Work Please Refer Appendix A)

2.1 Outliers and Outlier detection

2.1.1 Types of Anomalies

b) Context Anomaly: There can be situation where the data-point is an

c) Collective Anomaly: There can be situation where a group of related data-

Figure 2.1 Point Anomaly Figure 2.2 Context Anomaly

2.2.1 K-Means Flowchart

The formula to ascertain the new cluster center is:

is the number of data points in the ith cluster and

is the data point of a cluster

Group the points on the basis

Figure 2.4 Flow chart of k-means algorithm

2.3 Principal Component Analysis

PCA is a measurable methodology that uses an orthogonal transformation to change

2.3.1 Working of PCA

a) Compute the covariance matrix X of data-points.

2.4.1 Particle Swarm Optimization

Assume there is a swarm (a gathering of flying creatures). Presently, every one

2.4.1.1 PSO Flowchart

Initialize PSO parameter

Evaluate the fitness of all

Record personal best fitness

Update the position of particles

Figure 2.5 Flow chart of PSO Algorithm

3.1 Model Architecture

Figure 3.1 Architecture Of The Proposed Model

Step 1 Define the input dataset

Step 2 Apply k-means to find the most efficient distance method

Step 3 Apply PCA to reduce the dimension of the dataset

Step 5 Apply PSO 1 to get the optimized and better results

Step 6 Apply PSO 2

Step 7 Determine the gas which is the major cause of pollution

Each steps of the proposed methodology is explained in the following paragraphs.

Table 3.1 Dataset

CO_ PT0 NM C6 PT0 N0x PT08 NO2 PT0 PT08 T RH AH

2 1292 112 9 4 955 103 1174 92 1559 972 13 3