Documente Academic
Documente Profesional
Documente Cultură
BACHELOR OF TECHNOLOGY
IN
INFORMATION TECHNOLOGY
Submitted by
Supervised by
We declare that
a. the work contained in this report is original and has been done by us under the
guidance of our supervisor.
b. the work has not been submitted to any other institute for any degree or
diploma.
c. we have followed the guidelines provided by the institute to prepare the report.
d. we have conformed to the norms and guidelines given in the ethical code of
conduct of the institute.
e. wherever we have used materials (data, theoretical analysis, figures and text)
from other sources, we have given due credit to them by citing them in the text
of the report and giving their details in the references.
i
CERTIFICATE
Signature of Supervisor:
Date:
ii
LIST OF FIGURES
iii
LIST OF TABLES
iv
LIST OF ACRONYMS
v
ABSTRACT
Air pollution is increasing day by day, decreasing the world economy, degrading the
quality of life and resulting in a major productivity loss. At present, this is one of the
most critical problem. It has a significant impact on human health and
ecosystem. Reliable air quality prediction can reduce the outcome of a pollution on
the nearby population and ecosystem, hence improving air quality prediction is the
prime objective for the society. Clustering is the data analysis procedure that is used
to inspect and interpret the vast collection of data. Outlier detection is used to find out
the anomalies in the data that do not meet the presumed performance. Outlier
detection helps us to find out the two types of data points, that is normal and outliers
(the points that behave differently from normal points). By finding out these data
points we can find out the major cause of problem. The effectiveness of the outlier
detection methods contingent to a large scale on the ability of the underlying
clustering method. The ability of clustering methods in turn depends on the effective
and efficient choice of parameters like initial centroids and number of clusters
etc. Although nature inspired metaheuristic algorithms have shown a lot of promise as
optimizers for various processes, this has not been adequately explored in the existing
approaches to improve effectiveness of clustering algorithms. Clustering methods
bear from ambiguity when the dataset consists of groups with various composite
shapes and outliers. Furthermore, nature inspired algorithms have also been developed
to discover the exact solution of clustering issues. These algorithms provide better
quality results in comparison to traditional algorithms. For the same, Particle Swarm
Optimization(PSO) technique can be easily used for optimizing the clusters formed
thus fulfilling the goal of this research in an effective way.
vi
CONTENTS
Page No.
DECLARATION i
CERTIFICATE ii
LIST OF FIGURES iii
LIST OF TABLES iv
LIST OF ACRONYMS v
ABSTRACT vi
CHAPTER 1: INTRODUCTION 1
1.1 Background 1
1.2 Motivation 2
1.2.1 Case Studies 3
1.3 Literature Review 4
1.4 Research Gaps 9
1.5 Research Objectives 11
CHAPTER 2: CLUSTERING AND METAHEURISTICS 12
2.1 Outliers and Outlier Detection 12
2.1.1 Types of Anomalies 12
2.2 Clustering 13
2.2.1 K-Means Flowchart 13
2.3 Principal Component Analysis 15
2.3.1 Working of PCA 15
2.4 Metaheuristics 15
2.4.1 Particle Swarm Optimization 16
CHAPTER 3: EXPERIMENT ANALYSIS 18
3.1 Model Architecture 18
3.2 Solution Algorithm 19
CHAPTER 4: RESULT AND DISCUSSION 24
vii
CHAPTER 5: CONCLUSION 29
REFERENCES 30
APPENDIX A: PLANNING OF WORK 34
APPENDIX B: CODE 35
viii
Chapter 1
Introduction
1.1 Background
With the evolution of the economy and society everywhere on the planet, the world is
experiencing increased concentrations of air pollutants. Currently, the environmental
downside is the most severe problem that features a major influence on human health
and ecosystem. Various efforts are placed by government towards the management of
pollution, and much success has been obtained within the same. Air pollutants are
emitted mostly by business factories and vehicles due to the increasing usage of
hydrocarbon and different petrochemicals and fossil fuels[21]. Human health problem
is one of the necessary consequences of air pollution, particularly in urban areas. The
global warming from phylogeny greenhouse gas emissions may be a long run
consequence of air. Correct air quality prediction can cut back the effect of a pollution
peak on the encircling population and ecosystem, hence rising air quality prediction is
a very important goal for society [19].
An air waste may be a matter within the atmosphere which will have severe
impacts on society and therefore on the entire surrounding. The matter can be
something like solid-particles, liquid-droplets or gases. There are many positions,
tasks or factors that are chargeable in the emission of pollutants in the
surroundings. The sources or origin of pollutants can be categorised in 2 main classes
–Anthropogenic (phylogeny) sources and Natural sources. Phylogeny sources are
largely related with the burning of various sort of fuel such as smoke stacks of fossil
fuel power stations, motor vehicles, marine vessels, aircraft, nuclear weapons, toxic
gases. Natural sources include dust, smoke, volcanic activities and ash particulates
[18].
Environmental pollution prices the global wealth $5 trillion every year as an outcome
of efficiency deprivation & deteriorating standard of living. These productivity
deprivation are caused by demise because of disorder caused by environmental
pollution. One out of ten demises is caused by ailment related to atmospheric
pollution and therefore the situation is becoming unsatisfactory. This scenario is
increasing in the developing society [17].
Page 1 of 35
The Air Quality Health Index (AQHI) is a public information method developed to
help perceive the impact of environmental condition on society. The AQHI is defined
as an indicator or ranking scale ranges from one to ten plus supported mortality study
that gives the amount of effect on health related to native atmosphere quality. The
more high the amount, larger the effect on health and the necessity to perform certain
actions [1].
Most recent air quality prediction uses effortless methods viz box models, Gaussian
models and linear statistical models. All of the above models are quite simple to
implement and enable for the quick calculation of predictions. Nevertheless, they
normally don‟t describe interactivity & non-linear relationships that command the
transfer and nature of adulterants in air. With these provocations, machine learning
approaches like outlier detection have become favoured in air quality prediction and
other environmental problems [20]. Outlier detection has been used for ages to find
and, where possible, eliminate ambiguous outputs from data. Outliers emerge because
of mechanical defects, changes in system behaviour, fraudulent nature, human
mistakes, instrumental mistakes or solely by genuine deflection in the society. Their
observation can determine glitches before they will intensify with the probable
disastrous results. It results in finding out the errors and reduce their contaminating
impact on the complete set of data. This data is then as such purified for further
processing. The technique of outlier detection has its application in so many fields and
some of them are frauds-detection, intrusions-detection for cyber-security, insurance
etc. [1].
Due to the proven aspect that outliers are one of the origin to largely affect the data
quality, in this regard we offer a complete summary of the research done in the field
of outlier detection in prediction of environmental pollution, assess and contrast
present outlier detection methodologies mainly designed for air quality prediction,
and find out the possible areas for more analysis [2].
1.2 Motivation
Atmospheric pollution is the main problem all around the world, and current statistics
in India, China and other countries featured the ceilings being surpassed.
Page 2 of 35
1.2.1 Case Studies:
a) New Delhi: Air pollution in Delhi worsened. The toxic smog continued to
envelope the national capital with reduced visibility in some areas as can
be seen in Figure 1.1 [6].
b) Haryana: The muddy air, with less wind to scatter it, partially originates
from the yearly post-harvest blazing of crop straw in Punjab and Haryana
as shown in the below Figure 1.2. The extent of hazardous pollutants in the
atmosphere has increased. Blazing of agricultural fuel remaining, or Crop
Residue Burning (CRB) has been found as a prominent health threat [7].
Page 3 of 35
1.3 Literature Review
The following Table 1.1 shows the comparative study of the various research work
done from year 2009 to 2018 that is similar to our topic of research. It clearly
mentions the findings and limitations of each work.
Page 4 of 35
Monitoring System , 2010 named Wireless Sensor capabilities
Network Air Pollution
Monitoring System
(WAPMS) to screen air
contamination in Mauritius
using remote sensors sent
in enormous numbers
around the island. So as to
improve the productivity
of WAPMS, they have
planned and executed
another information
conglomeration calculation
named Recursive
Converging Quartiles
(RCQ). The calculation is
utilized to blend
information to dispose of
copies, sift through invalid
readings and outline them
into a more straightforward
structure which
fundamentally decrease the
measure of information to
be transmitted to the sink
and consequently sparing
energy[24].
NeuroFuzzy Modeling The principle goal of this Neural system
Scheme for the Prediction paper is two folds. The demonstrating plan gives a
of Air Pollution,2010 first is to create Fuzzy productive computational
controller plot for the apparatus to mapping
expectation of the input-yield or cause-
changing for the NO2 or impact connections.
Page 5 of 35
SO2, over urban zones
dependent on the
estimation of NO2 or SO2
over characterized modern
sources. The second is to
build up a neural net, NN
conspire for the
expectation of O3
dependent on NO2 and
SO2 measurements[25].
Artificial neural networks In this paper a novel Due to the severe air
forecasting of PM2.5 hybrid model pollution, the overall
pollution using air mass consolidating air mass accuracy of the prediction
trajectory based direction examination and is not so satisfactory.
geographic model and wavelet change is work to
wavelet transformation, improve the artificial
2015 neural network (ANN)
figure exactness of every
day normal groupings of
PM2.5 two days ahead of
Page 6 of 35
time is presented[27].
A Machine Learning This exploration created This approach needs a lot
Approach for Air Quality effective machine learning of training data for future
Prediction: Model method for air toxin prediction
Regularization and forecast. They have
Optimization,2018 defined the issue as
regularized MTL(Multi
undertaking learning) and
utilized propelled
enhancement calculations.
They propose refined
models to foresee the
hourly air contamination
focus based on
meteorological information
of earlier days by detailing
the expectation more than
24 h as a perform multiple
tasks learning (MTL)
problem[28].
Page 7 of 35
standard deviation of the
typical conveyance hidden
the truncated ordinary
circulation of the NO2
observations[29].
Sensor-Based The exploration is on There is no global
Optimization Model for sensor-based home IoT, standard of similarity for
Air Quality Improvement examines on indoor air the labeling and checking
in Home IoT,2018 quality, and specialized equipment. Similarly, as
examinations on arbitrary with every single complex
data generation. It framework, there are more
additionally builds up an chances of
air quality improvement disappointment.
model that can be
promptly connected to the
market. As needs be, we
produce related
information dependent on
user conduct esteems. We
coordinate the rationale
into the current home IoT
framework to empower
clients to effectively get to
the framework through the
Web or portable
applications[30].
Air quality forecasting This is the principal The examinations utilize
based on cloud model endeavour to foresee AQIs free open information, If
granulation,2018 by cloud model progressively more
granulation. This strategy substance of chemical and
changes the issues in the meteorological conditions
first data space to the data got, the investigation
element space and from may get increasingly exact
Page 8 of 35
highlight space to idea expectation results.
space. At last, the answer Second, it supplanted the
for the first issue is missing information with
practiced by consistent the mean or most. It might
information reduction[31]. prompt one-sided results.
Page 9 of 35
generated which should closely relate to the “unusualness” of the specific data
instances. For instance, if using nearest neighbour based clustering, a outlier
score can be established as number of neighbours for a specific 'n' parameter.
The outlier scores that have been suggested in existing methods are very
trivial. A well-defined outlier score can also be used as an indicator for
proactive alerting and may reduce the time lag between occurrence of outlier
and its detection thereby increasing the efficiency of the system. Moreover the
problem of increased false positives upon changes in working conditions can
be simplified to a tuning of threshold outlier scores.
d) The complexity of the process of outlier detection in time series data is
affected to a great extent by the curse of dimensionality. The higher the
dimensionality of the data series, the more time, effort and power needed to
analyse the data so as to be able to find out either the anomalies in a time
series or a subsequence out of a time series database. Current approaches have
the limitations that they are not able to identify the dimensions that effectively
represent the data series and need to be preserved.
e) Outlier detection methods are affected largely by the problem of False
positives. This not only reduces the efficiency of detecting the actual outliers
but also has an adverse effect on the effectiveness and acceptability of the
detection methodologies. The existing methods lack in both sensitivity and
specificity. There are some existing ways to improve precision by using
methods like generalization but these methods have the downside that they
may cause loss of accuracy due to over generalization. An optimal
methodology must increase the accuracy and precision in a way that can be
measured tangibly.
f) Most of the outlier detection methods in use today are not designed to handle
the huge volumes of data that are generated today. In an era of Big-Data, the
currently existing methods of outlier detection are finding themselves
inadequate. We need to develop an approach that is not only effective and
efficient with existing data, but can also be successfully scaled for handling
big data.
Page 10 of 35
1.5 Research Objectives
Based on the various research gaps identified in the literature survey phase, this
research strives to overcome at least some of the gaps thereby trying to develop an
effective and efficient methodology for detecting outliers.
a) The proposed research intends to find out the most efficient distance method
which can be used in the application of k-means algorithm.
b) The proposed research also intends to apply an appropriate optimization
methodology so as to improve the performance of the data mining algorithm
that is forming the basis for the outlier detection process.
c) The proposed research also endeavours to identify an efficient method to
distinguish those attributes that are representative of causing the major
problem of air pollution. We would attempt to develop an optimum scalable
methodology that can cope with the large amounts of data being generated
today.
Page 11 of 35
Chapter 2
Clustering and Metaheuristics
Page 12 of 35
Figure 2.3 Collective Anomaly
2.2 Clustering
Clustering is one of the significant data analysis procedure which is utilized to
investigate and comprehend the huge collection of data. Clustering has demonstrated
its probability in different fields like bioinformatics, pattern acknowledgement,
picture processing, medical mining and some more. The purpose of the clustering is to
organize objects into clusters dependent on the estimations of their qualities.
Recently, numerous analysts have a noteworthy enthusiasm in developing clustering
algorithms [10]. The issue in clustering is that we don‟t have early information
knowledge about the given dataset. Also, the decision of input parameters such as the
number of clusters, number of closest neighbours, halting criteria and other factors in
these algorithms make the clustering increasingly challengeable. These algorithms
also experience the ill effects of unsuitable precision when the dataset contains
clusters with various complex shapes, densities, and anomalies. Furthermore, nature
based algorithms have also been developed to find the exact solution of clustering
problems. These algorithms provide better quality results in comparison to traditional
algorithms [11].
Page 13 of 35
In this calculation the k-focuses change their areas well-ordered until no more
changes are done or in other words until focuses don't move any more.
where,
Start
Number Of
Clusters K
False
Determine the
Centroids
No object
move End
Determine the distance of group True
objects from centroids
Page 14 of 35
In the k-means algorithm we initially need to choose the number of clusters. Then we
take the centroid for each cluster randomly from the dataset. After finding the centroid
all the data points are alloted to one of the groups based on the minimum distance.
Then the mean of each cluster is calculated and this mean value is taken as the
updated centroid value. The complete process continues until we get the same clusters
in two iterations [13].
Covariance matrix used in PCA is symmetric. The resulting dimensions are linear and
independent in nature and they can losslessly represent the data points. The new
measurements enable us to reproduce the first measurement and the remaking error
ought to be limited [32].
2.4 Metaheuristics
Metaheuristics are the algorithms motivated by nature and utilized for optimizing
problems to get the most expected result. A metaheuristic is a general algorithmic
structure that can be applied to various optimization issues with moderately couple of
changes to make them adjusted to a particular issue. Instances of metaheuristics
incorporate simulated annealing (SA), tabu search(TS), iterated local search(ILS),
evolutionary algorithms(EA), and ant colony optimization(ACO), particle swarm
Page 15 of 35
optimization(PSO). The objective of metaheuristics is to effectively investigate the
inquiry space so as to discover near–optimal solutions [33].
Page 16 of 35
particle is selected. Then it is checked whether the swarm met the termination
criteria or not. If not then update the position of the particles and again
calculate and evaluate the fitness of all the particles. If yes then the algorithm
ends. This algorithm is inspired by the nature of swarm of searching food. The
same algorithm can be easily used in computer science. For instance in our
research we tried to find out the major cause of pollution using this algorithm
[14].
Start
Generate first
swarm
Swarm met
No
the
termination
criteria?
Yes
Yes
End
Page 17 of 35
Chapter 3
Experiment Analysis
Apply K-means
Apply PCA
Apply K-means
using the most
Determine the gas
efficient
which is most
Dataset distance
responsible for air
method
pollution
obtained from
step 2
Apply PSO 1
Apply PSO 2
INPUT OUTPUT
Page 18 of 35
3.2 Solution Algorithm
Following are the steps for solution methodology:
Step 4 Apply k-means using the most efficient distance method obtained from step 2
The Air Quality dataset is used as input which consist of the data about the
concentration of the gases present in the air. The data used is the multidimensional
data. The dataset have 13 columns (CO_GT, PT08_S1_CO, NMHC_GT, C6H6_GT,
PT08_S2_NMHC, Nox_GT, PT08_S3_Nox, NO2_GT,PT08_S4_NO2,PT08_S5_O3,
T, RH, AH) and 9358 rows corresponding to each column as shown in Table 3.1.
Page 19 of 35
In the first part, Kmeans is applied with some modifications. Kmeans basically uses
Euclidean distance method for calculating the separation between 2 datapoints and on
basis of that cluster formation or analysis is done. Here, K-means is modified as rather
than using the Euclidean distance directly, five different distance formulas are used in
order to find out the efficient one. The different formulas used in this research are as
follows:
The equation for calculating this distance between 2 points (X, Y) is:
d= ∑
b) Manhattan
This distance method is used if a grid-like path is followed. The equation for
this distance between 2 points (X, Y) is:
d=∑ | |
where,
n is the number of variables, and Xi and Yi are the values of the ith variable, at
points X and Y respectively [15].
c) Pearson Correlation
It calculates the similarity in shape between 2 profiles. The formula for the this
distance is:
d = 1- r
where
r = Z(x).Z(y)/n
d) Chebychev
Page 20 of 35
The Chebychev distance between 2 points is the maximum span between the
points in any one dimension. The separation between 2 points (X, Y) is find
out using the formula:
Maxi | |
here Xi, Yi are the estimates of the ith variable at points X, Y respectively
[16].
∑
1-
The efficiency of these distance formula is measured on the basis of the MSE and
RMSE values. The output of this experiment is the MSE and RMSE values. The
lower the RMSE value, the more efficient the distance formula is. So, after getting the
distance formula (that is the most efficient among all) it would be used for the further
implementation of K-means and cluster formation. It means that instead of using the
Euclidean formula in K-means for distance calculation, now we will use that distance
formula whose RMSE value comes out to be the lowest.
In the second part, the main aim is to optimize the k-means algorithm using PSO and
the distance method used in k-means is euclidean distance (as obtained from step 2).
For this, we first dropped the null values from the dataset and normalized it in order to
reduce the redundancy. Then, we applied PCA to reduce the dimensions from 10 to 2
(as shown in Figure 3.2) and to find the correlation between variables. Now, the
Page 21 of 35
variables can be plotted in the form of x and y axis only and the clusters could be
made efficiently.
Figure 3.2 The value of two dimensions (x,y) after using PCA
In the next step we applied k-means clustering algorithm where the number of clusters
are 3. In this algorithm we have used a boolean variable init_pp and two functions k-
means and k-means2. If the value of init_pp=True then k-means 2 is called and the
centroid of the clusters are selected randomly using the random function and if the
value of init_pp=False then k-means is called and the centroid of the clusters are
selected randomly from the dataset. After finding the centroids we have used fit
function which is used to assign the different values of the data to different clusters on
the basis of the distance and centroids of each cluster are updated. Finally, we have
calculated the silhouette score, SSE and quantization error and plotted the clusters for
the same. By this we concluded that k-means 2 works better than k-means as can be
seen from the values of silhouette score, SSE and quantization error.
Next we have applied PSO Algorithm using two different approaches. The two
approaches are named as PSO 1 and PSO 2. Also, in this algorithm we have used a
boolean variable use_kmeans. If the value of use_kmeans is True then the centroids
obtained from k-means are replaced by the centroids obtained from the PSO 1
algorithm i.e. the end condition of k-means becomes the start condition of PSO 1.If
the value of use_kmeans is False then the centroids obtained from k-means are
optimized by using PSO 2 algorithm i.e this PSO 2 optimize the centroid obtained at
each step. Finally, the two approaches are compared on the basis of their gbest score
which represent the span between data point and cluster head. The minimum is the
value of gbest score, the better is the method of optimization [22].
Page 22 of 35
Now to predict the major gas responsible for air pollution we select any cluster and
find out the gases present in that cluster. After that we find the mean of all the gases
individually and plot the bar graph showing the mean concentration of each gas.
Page 23 of 35
Chapter 4
Result and Discussion
The result of the step 2 of proposed solution algorithm is that the RMSE value of the
Euclidean distance is the lowest as can be seen from the below attached outputs. It
implies that in the further implementation, the Euclidean distance will be used in K-
means for clusters formation. Figure 4.2-4.5 shows the RMSE and MSE values of
various distance methods used.
The result of using PCA is the correlation matrix as can be seen in Figure 4.6.
Page 24 of 35
Figure 4.6 Correlation between gases
The scale of correlation is such that the light colours represent the more correlation
between two variables whereas the dark colours represent the least correlation
between two variables. The high value of correlation means that the two gases will be
present together in every situation.
Further, on applying k-means algorithm we get the clusters formed as shown in Figure
4.7. The black points in the image shows the cluster head. Here, the centroid of the
clusters are chosen unusually from the dataset. The datapoints that don‟t exist near
any clusters are the anomalies in the dataset.
Next, on applying k-means 2 algorithm we get the clusters formed as shown in below
Figure 4.8. The blackish dots in below image shows the cluster head. Here, the
Page 25 of 35
centroid of the clusters are selected randomly using the random function. The
datapoints that don‟t exist near any clusters are the anomalies in the dataset.
The above two approaches can be compared using silhouette score, sse, quantization
error. From the below Figure 4.9 and Figure 4.10 it can be concluded that k-means 2
is better than k-means as the lesser is the value of sse the better is the method.
Figure 4.9 Silhouette, SSE, Quantization Figure 4.10 Silhouette, SSE, Quantization
Error values of k-means Error values of k-means 2
Now, since in PSO 1 the centroids obtained from k-means are replaced by the
centroids obtained from the PSO 1 algorithm so, the clusters formed by PSO 1 can be
shown as below Figure 4.11. Also, in PSO 2 the centroids obtained from k-means are
optimized by using PSO 2 algorithm and thus the clusters formed by PSO 2 can be
shown as below in Figure 4.12.
The above two PSO approaches can be compared using gbest score which represent
the span between data point and cluster head. The minimum is the value of gbest
score, the better is the method of optimization.
Page 26 of 35
Figure 4.11 Cluster Formation using PSO 1
Page 27 of 35
Figure 4.14 Gbest score of PSO 2
From Figure 4.13 and Figure 4.14 it is concluded that PSO 1 is better than PSO 2 in
our case as the value of gbest score of PSO 1 less as compared to that of PSO 2.
Now, the below bar graph shows the concentration level of the various gases present
in the air. So, from this graph we can conclude that the gas PT08_S5_O3 is the major
cause of air pollution as it is present in the atmosphere in the highest concentration
thus depleting the quality of the air.
Page 28 of 35
Chapter 5
Conclusion
The proposed research is to find out the gas which is most responsible for the air
pollution. We started to achieve our aim by first implementing the k-means algorithm
using five different distance methods. We compared the distance methods on the basis
of their RMSE values. The lesser the RMSE value the better is the distance method.
So, from this we concluded that Euclidean distance is the best as it‟s RMSE value is
the least among all. Then we used PCA to reduce the dimensions of our dataset from
10 to 2. This reduced dataset is further used to implement the k-means and PSO
algorithms. We used two approaches to implement k-means in our solution. In the
first approach i.e. k-means, the centroids are taken randomly from the dataset whereas
in the second approach i.e. k-means 2, the centroids are taken randomly using the
random function. Using k-means and k-means 2 we plotted the graphs showing the
cluster formation which thus helps to determine the outliers. Also, the two approaches
were compared on the basis of their silhouette score, sse and quantization error values
and the conclusion was made that k-means 2 is better than k-means because the value
of silhouette score, sse and quantization error is less in case of k-means 2. Further we
optimized our results using the PSO algorithm. PSO algorithm is also implemented
using two different approaches i.e. PSO 1 and PSO 2. In PSO 1, the centroids
obtained from k-means are replaced by the centroids obtained from the PSO 1
algorithm whereas in PSO 2 the centroids obtained from k-means are optimized by
using PSO 2 algorithm. Both the PSO approaches were also compared on the basis of
their gbest score. The less is the value of gbest score the better is the optimization
algorithm. Here the gbest score of PSO 1 is less than the gbest score of PSO 2 so, in
our case PSO 1 is better as compared to PSO 2.
Using the above solution methodology, the gas PT08_S5_O3 is found to be most
responsible for the air pollution as the concentration of this gas is maximum among
all the gases. Hence, fulfilling the objective of this research. Now as we know the
reason of air pollution, so we can take the appropriate actions to minimize the
environmental pollution thus improving the quality of air and health of homo sapiens.
Page 29 of 35
References
[1] Air Quality Prediction by Machine Learning Method by Huiping Peng B.Sc., Sun
Yat-sen University, 2013
[2] 2017_5$thumbimg112_May_2017_070949977.pdf
[3]ParticleSwarmOptimizationhttps://en.wikipedia.org/wiki/Particleswarmoptimizatin
,accessed on 9 May 2019 at 5:30 PM
[4] Saptarshi Sengupta *, Sanchita Basak and Richard Alan Peters II, “Particle Swarm
Optimization: A survey of historical and recent developments with hybridization
perspectives”
[5]PrincipalComponentAnalysishttps://en.wikipedia.org/wiki/Principal_component_a
nalysis , accessed on 9 May at 6 PM
Page 30 of 35
[12]k-means clustering algorithm
https://sites.google.com/site/dataclusteringalgorithms/k-means-clustering-algorithm
accessed on 10 May 2019 at 12:30 PM
[13] Understanding K-means Clustering in Machine Learning
https://towardsdatascience.com/understanding-k-means-clustering-in-machine-
learning-6a6e67336aa1, accessed on 10 May 2019 at 2 PM
[14] Implementing the Particle Swarm Optimization (PSO) Algorithm in Python
https://medium.com/analytics-vidhya/implementing-particle-swarm-optimization-pso-
algorithm-in-python-9efc2eb179a6, accessed on 10 May 2019 at 5 PM
[15]Methods for measuring distances
http://www.sthda.com/english/wiki/print.php?id=235, accessed on 10 May 2019 at
5:25 PM
[16] Distance Metrics Overview
http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clustering_Param
eters/Distance_Metrics_Overview.htm, accessed on 10 May 2019 at 5:30 PM
[18] Francisc Popescu and Ioana Ionel, “Anthropogenic Air Pollution Sources”,
August 18th 2010
Page 31 of 35
[23] D.Deniz GencEmail authorCanan YesilyurtGurdal Tuncel "Air pollution
forecasting in Ankara, Turkey using air pollution index and its relation to assimilative
capacity of the atmosphere,2009", Environmental Monitoring and Assessment July
2010, Volume 166, Issue 1–4, pp 11–27.
[25] Tharwat E. Alhanafy*1, Fareed Zaghlool1, and Abdou Saad El Din Moustafa
31Computer and System Engineering Department, Al-Azhar University, Cairo, Egypt
2 Director General, Arab Co. for Engineering & Systems Consultations (AEC)
Certified Lead Auditor "Neuro Fuzzy Modeling Scheme for the Prediction of Air
Pollution,2010".
[26] Srinivas Devarakonda, Parveen Sevusu, Hongzhang Liu, Ruilin Liu, Liviu Iftode,
Badri Nath Department of Computer Science Rutgers University Piscataway, NJ
08854-8091 "Real-time Air Quality Monitoring Through Mobile Sensing in
Metropolitan Areas,2013".
[27] Xiao Feng*, Qi Li, Yajie Zhu, Junxiong Hou, Lingyan Jin, Jingjie WangInstitute
of Remote Sensing and GIS, Peking University, Beijing 100871, China "Artificial
neural networks forecasting of PM2.5 pollution using air mass trajectory based
geographic model and wavelet transformation, 2015"
[28] Dixian Zhu 1*, Changjie Cai 2, Tianbao Yang 3 and Xun Zhou 4 "A Machine
Learning Approach for Air Quality Prediction: Model Regularization and
Optimization,2018"
[29] V. M. van Zoest & A. Stein & G. Hoek "Outlier Detection in Urban Air Quality
Sensor Networks ,2018"
Page 32 of 35
[30] Jonghyuk Kim 1and Hyunwoo Hwangbo "Sensor-BasedOptimization Model for
Air Quality Improvement in Home IoT,2018"
[31] Yi Lin1, Long Zhao3, Haiyan Li4and Yu Sun2 "Air quality forecasting based on
cloud model granulation,2018"
Page 33 of 35
Appendix A
Planning of Work
4
Column2
3 Column1
Months
2
0
Literature Survey Determining Implementation Paper Writing
Objectives
Page 34 of 35
Appendix B
Code
(Available in CD with name pollution_code.zip)
Page 35 of 35