Documente Academic
Documente Profesional
Documente Cultură
Research Article
INTRODUCTION
Rainfall plays an important role in the agro-economy of climatic studies have used a variety of data to define
Bangladesh, located in tropical zone. Its climate is climatic types and delineate zones of similar climate.
characterized by large variations in seasonal rainfall with Several methods have also been applied for the detection
moderately warm temperatures and high humidity. Due to of homogeneous climate zone. In this study, cluster
its geographic location and dense population, Bangladesh analysis methodology has been used. Cluster analysis
has been identified as one of the most vulnerable countries applied to meteorological variables is a suitable approach
to climate change (Islam, 2009). The investigation has for identifying the climate zones, and its use is becoming
been carried out using monthly records of important increasingly more common in atmospheric research (Erin,
climatic variable rainfall observed at 34 ground based 1984; Kalkstein et al. 1987; Tayan et al. 1998). Choosing
stations of Bangladesh Meteorological Department (BMD) appropriate data to cluster is an initial consideration in
distributed over the country during the time period 1991- cluster analysis. In climate classification, the variability of
2013 (http://www.data.gov.bd/). From the combined trend long-term rainfall is the most readily available variables
of rainfall and maximum temperature intensity (determined (Linacre, 1992). In this study we intend to define spatially
by GIS mapping), geographically Bangladesh is divided homogeneous climate regions of Bangladesh by using a
into four regions such as; North-Eastern Region, South- mathematical methodology called cluster analysis.
Eastern Region, South-Western Region and North-
Western Region. Another research show that the
information from each station have been studied and
analyzed, while grouping the stations in one of the eight
hydrological (planning) regions of Bangladesh. North East
(NE), North Central (NC), North West (NW), South East
(SE), South Central (SC), South West (SW), Eastern Hill
(EH) and River and Estuary (RE) which are defined in *Corresponding Author: Md. Siraj-Ud-Doulah,
qualitative terms, not quantitatively. This zone Department of Statistics, Begum Rokeya University,
classification has been used not only for differences in Rangpur, Bangladesh. Email: sdoulah_brur@yahoo.com
climate but also for social and economic variables. Many
The investigation has been carried out using daily records For clustering purposes there are two widely used
of one important climatic variable, rainfall, observed at 34 methods: the hierarchical and the non-hierarchical
ground based stations of Bangladesh Meteorological (partitional). The hierarchical clustering process can be
Department (BMD) distributed over the country during the categorized as divisive when a large data set is divided
time period 1991-2013 (http://www.data.gov.bd/). into several small groups and, agglomerative when a small
Although Bangladesh Meteorological Department (BMD) data set are put together to create a larger cluster (Dyeret,
has thirty-six (36) ground based stations, but only data of 1975; Gan et al. 2007; Sarah et al. 2011). There are so
thirty-four (34) stations has been taken in this research. At many descriptive statistics available in the literature
initial stage, quality of rainfall data is checked by verifying (Doulah, 2018) for evaluating the data that we have
the following criteria (Erin, 1984; Masoodian, 2005) applied the most frequently used measures in our analysis
first and then we have used the clustering techniques.
(i) Non-existence of dates
(ii) Negative monthly rainfall Agglomerative Algorithms
(iii) Monthly winter rainfall>100mm
(iv) Weather stations > 35% missing data Some of the agglomerative algorithms are: single linkage,
(v) Stations with gaps three or more years in between complete linkage, average linkage, centroid and Ward’s
series method. Several proximity measures like Euclidean
distance, Minkowski distance, Manhattan distance,
If any of the above mentioned point is true for any dataset, maximum distance, correlation based distance and
it is identified as erroneous data. So, two BMD stations are Canberra distance are used. The partitioned clustering
discarded after following the preceding conditions process is based on recover the natural grouping present
considering data period from 1991 to 2013. R-based in the data thought a single partition. The partitioned
program is used to detect homogenous climate zones. algorithms are divided as: K-means, Fuzzy and model
based clustering techniques (Hossen et al. 2015; Han &
Kamber, 2006; Johnson & Wichern, 1998).
Non-hierarchical Algorithms
K-means clustering
K-means clustering intends to partition n objects into k
clusters in which each object belongs to the cluster with
Defining Homogenous Climate zones of Bangladesh using Cluster Analysis
Int. J. Stat. Math. 121
Table 2: Some of the distance measures (b) Estimate the parameters using the EM algorithm;
Distance Statistic (c) Choose the model and the number of clusters
Euclidean according to the BIC.
d ( x, y ) ( xi yi )
2
In this method, a model is hypothesized for each cluster to
Manhattan p find the best fit of data for a given model. Also, this method
d ( x, y ) xi yi locates the clusters by clustering the density function.
i 1 Thus, it reflects the spatial distribution of the data points.
Minkowski 1/ m This method also provides a way to determine the number
p m
d ( x, y ) xi yi
of clusters. That was based on standard statistics, taking
outlier or noise into account. It, therefore, yields robust
i 1 clustering methods.
Maximum d ( x, y) max xi yi
Validity Indices
Correlation p
( x x)( y y)
i i In the literature of data clustering, a lot of clustering
d cor ( x, y ) 1 i 1
algorithms have been proposed for different applications
p n
and different sizes of data. But clustering a dataset is an
( xi x)2 ( yi y)2
i 1 i 1
unsupervised process; there are no predefined classes
and no examples that can show that the clusters found by
Canberra p
xi yi the clustering algorithms are valid (Hardy, 1996; Luxburg,
d ( x, y ) 2010). To compare the clustering results of difference
i 1 xi yi clustering algorithms, it is necessary to develop some
validity criteria. Also, if the number of clusters is not given
Algorithm in the clustering algorithms, it is a highly nontrivial task to
1. Clusters the data into k groups where k is predefined. find the optimal number of clusters in the data set. To do
2. Select k points at random as cluster centers. this, we need some cluster validity methods. The notation
3. Assign objects to their closest cluster center according & meaning of the validity indices are: n = number of
to the Euclidean distance function. observations, p= number of variables, q= number of
4. Calculate the centroid or mean of all objects in each
cluster.
clusters,
X = xij , i 1, 2,......, n ; j 1, 2,....., p ; =
5. Repeat steps 2, 3 and 4 until the same points are n p data matrix of p variables measured on n
assigned to each cluster in consecutive rounds.
independent observations, x = centroid of data matrix X
Fuzzy clustering , nk = number of objects in cluster Ck ,
xi = p -dimensional vector of observations of the i th object
The Fuzzy clustering is a clustering algorithm developed
by Dunn, and later on improved by Bezdek (Luxburg, in cluster Ck ,
2010). It is useful when the required numbers of clusters q
are pre-determined; thus, the algorithm tries to put each of Wq = ( xi ck )( xi ck )T is the within-group
the data points to one of the clusters. What makes FCM k 1 ick
different is that it does not decide the absolute membership
dispersion matrix for data clustered into q clusters,
of a data point to a given cluster; instead, it calculates the
p
Bq = nk (ck x )(ck x )T
likelihood (the degree of membership) that a data point will
belong to that cluster. Hence, depending on the accuracy is the between-group
k 1
of the clustering that is required in practice, appropriate
tolerance measures can be put in place. Since the dispersion matrix for data clustered into q clusters,
absolute membership is not calculated, FCM can be T =Total Sum of Squares,
extremely fast because the number of iterations required
1 p BGSS j
S =
2
to achieve a specific clustering exercise corresponds to ,
the required accuracy. p j 1 TSS j
p
Model-Based clustering BGSS j = nk (ckj x j ) 2 and
k 1
The model-based clustering framework consists of three p
major steps (Baldwin & Lakshmivarahan, 2002; Everitt, TSS j = ( xij x j ) 2 . The following cluster validity
1993): i 1
(a) Initialize the EM algorithm using the partitions from methods are given in Table 3 below:
model-based agglomerative hierarchical clustering.
To settle the cluster number is a difficult task since there is RESULTS AND DISCUSSIONS
not a specific method for this purpose and the number is
the result of the assignation of training clusters until the The statistical analysis for the monthly rainfall data of 34
optimal value is found. Some of the indexes to be used for meteorological stations are summarized in Table 4, where
establishing the number of clusters can also be employed the mean, standard deviation (SD), coefficient of variation
to validate cluster quality. (CV), skewness (S) and kurtosis (K) are given.
Chuadanga, Rajshahi and Ishurdi stations were less
monthly rainfall affected station.
Now we mentioned below the dendrogram of several linkage methods based on different distance measures for the
monthly rainfall data of 34 meteorological stations.
Single Linkage
Complete Linkage
Average Linkage
Ward.D
Centroid
Here we checked the validity of the cluster of climate Fuzzy clustering methods are the best methods among all
variable, rainfall, by using well-recognized nine validity other methods and Bangladesh has seven homogenous
indices. In this paper from Table 5 & Figure 9 we found that climate zones for analyzing rainfall data.
there are seven clusters in our dataset. Therefore, it is to
be concluded that there are seven homogenous climate
zones in Bangladesh. ACKNOWLEDGEMENTS
Gong X, Richman MB. (1995). On the application of cluster Masoodian SA. (2005). Regionalization of Precipitation
analysis to growing season precipitation data in North Regimes of Iran Using Cluster Analysis, Journal of
America east of the Rockies, Journal of Climate, 8:897– Research in Geography, 52:47-61.
931. Nathan RJ, McMahon TA. (1990). Identification of
Gan G, Ma C, Wu J (2007). Data Clustering Theory, homogeneous regions for the purpose of
Algorithms, and Applications. ASA, Alexandria. regionalization. J. Hydrol. 121:217–238.
Hardy A. (1996). On the number of clusters. Sarah N, Alaa K, El-Halees M. (2011). Implementation of
Computational Statistics and Data Analysis, 23:83-96. Data Mining Techniques for Meteorological Data
Han J, Kamber M (2006). Data Mining: Concepts and Analysis, IJICT Journal, 1(3):25-37.
Techniques. Morgan Kaufmann Publishers, San Yashwant S, Sananse SL. (2015). Comparisons of
Francisco. Different Methods of Cluster Analysis with application
Hossen MB, Doulah MSU. (2016). Identification of Robust to Rainfall Data, IJIRSET Journal, 4(11):49-64.
Clustering Methods in Gene Expression Data Analysis, Tayan M, Dalfes N, Karaca M, Yenigun O. (1998). A
Current Bioinformatics, 12:558-562. comparative assessment of different methods for
Hossen MB, Doulah MSU, Hoque A. (2015). Methods for detecting in homogeneties in Turkish temperature data
Evaluating Agglomerative Hierarchical Clustering for set, International Journal of Climatology, 18:561–578.
Gene Expression Data: A Comparative Study.
Computational Biology and Bioinformatics, 4(6):88-94.
http://www.data.gov.bd/
Islam MN. (2009). Rainfall and Temperature Scenario for
Bangladesh, The Open Atmospheric Science Journal,
3:93-103. Accepted 2 January 2019
Johnson R, Wichern D (1998). Applied Multivariate
Statistical Analysis. Englewood Cliffs, NJ: Prentice– Citation: Doulah S, Islam N (2019). Defining Homogenous
Hall. Climate zones of Bangladesh using Cluster Analysis.
Kalkstein LS, Tan G.R, Skindlov JA. (1987). An evaluation International Journal of Statistics and Mathematics, 6(1):
of 3 clustering procedures For use in synoptic 119-129.
climatological classification, Journal of Climate and
Applied Meteorology, 26:717–730.
Linacre E (1992). Climate Data and Resources: A
Reference and Guide. Routledge, London and New
York. Copyright: © 2019 Doulah and Islam. This is an open-
Luxburg U. (2010). Clustering stability: An overview, access article distributed under the terms of the Creative
Found. Trends Mach. Learn. 2 (3):235–274. Commons Attribution License, which permits unrestricted
Meila M. (2007). Comparing clustering’s – an information use, distribution, and reproduction in any medium,
based distance, Journal of Multivariate Analysis, 98 provided the original author and source are cited.
(5):873–895.