An Efficient CBIR Approach For Diagnosing The Stages of Breast Cancer Using KNN Classifier

Bonfring International Journal of Advances in Image Processing, Vol. 2, No.
1, March 2012 1
ISSN 2277 503X | 2012 Bonfring
Abstract--- This paper proposes a mammogram image
retrieval technique using pattern similarity scheme.
Comparing previous and current mammogram images
associated with pathologic conditions are used to diagnose the
real stage of breast cancer by doctors. Lack of awareness and
screening programs causes the breast cancer deaths. Early
detection is the best way to reduce the deaths per incident
ratio. Mammogram is the best one in the currently used
technique for diagnosing breast cancer. In this paper, the
retrieval process is divided into four distinct parts that are
feature extraction, kNN classification, pattern instantiation
and computation of pattern similarity. In feature extraction
step, low level texture features like entropy, homogeneity,
contrast, energy, correlation and run length matrix features
are extracted. These extracted features are classified using K-
Nearest Neighbor classifier to differentiate the normal tissue
from abnormal one. Each group is considered as patterns.
Finally, pattern similarity is estimated for retrieving images
based on their similarity with the query image. This scheme is
effectively applied to the Content Based Image Retrieval
systems to retrieve the images from large databases and
identify the real stage of breast cancer. If we find cancer in
early stages we can cure it.
Keywords--- Cancer Stage, Content Based Image Retrieval
(CBIR), KNN Classifier, Pattern
I. INTRODUCTION
HE increasing trend of cancer related deaths have forced
the humanity to work more on the cancer detection and
treatments. Cancer is a leading health problem in India, with
approximately 1 million cases occurring each year. Breast
cancer is one of the most common cancers and the second
most frequent cause of cancer-related deaths among women
[1]. It is a malignant tumor that develops from ductal and
lobular cells of the breast. A malignant tumor is a group of
cancer cells that affects surrounding tissues and also spread to
other parts of the body. The main risk factors of breast cancer
are later age at child birth, fewer children, shorter duration of
breast feeding, fear of self examination, fear of chemotherapy,
and the consumption of fatty foods has increased substantially.
Age with women of 40-69 years have more risk of breast

Jini.R. Marsilin, Department of Computer Science & Engineering,
Dr.Sivanthi Aditanar College of Engineering, Tiruchendur, India. E-mail:
jinirmarsilin@gmail.com
Dr.G. Wiselin Jiji, Professor & HOD, Department of Computer Science &
Engineering, Dr.Sivanthi Aditanar College of Engineering, Tiruchendur,
India. E-mail: jijivevin@yahoo.co.in

cancer [9].
World Health Organization reports that every year more
than 1.2 million people will be diagnosed with breast cancer
worldwide [16]. For the past 20 years breast cancer death rates
have remained steady even though the number of new cases
has grown, because of earlier detection and better treatments
[2]. Breast cancer is the most frequently diagnosed cancer of
women in US and other developed countries. Deaths per
incident ratio are higher in India almost at 50%. When it is
compared to china, it is only 30% and in US 18% [17]. This
implies breast cancer is not detected earlier in India. This
could be due to lack of screening programs and lack of a
culture of frequent self examination or breast awareness.
Therefore, even with the best treatments, breast cancer death
rate is high in India.
Breast cancer is diagnosed at early stages with the help of
the mammogram image. Mammogram is a low dose x-ray of
the breast. Early detection is needed to cure the breast cancer.
Early detection technique used in [3] detects the tumor from
mammogram images.
This proposed scheme is used to reduce the mortality
among women due to breast cancer by identifying the tumor in
initial stage using content based image retrieval (CBIR) and to
get the treatment in appropriate time. There is increasing
interest in the use of CBIR techniques to diagnose the stage of
breast cancer by identifying similar past cases. CBIR is the
process of retrieving desired images from a large collection
based on the features (such as color, texture and shape) that
are automatically extracted from the images themselves.
SIMPLicity [4] and FIRE [5] are the CBIR systems widely
used to retrieve the images.
Hierarchical clustering and K-Means clustering [6] works
faster for retrieving better favored image results. Large image
retrieval tasks efficiency is not satisfied for retrieving
radiographic images in CBIR approach used in [8]. Similarity
learning approach to content based image retrieval:
Application to digital mammography [10] requires prior
knowledge about the dataset. These approaches also introduce
constraints to the semantics required for image retrieval task
The main purpose of this proposed scheme is to develop
the dedicated CBIR systems for predicting the real stage of the
breast cancer by comparing the query image with the database
images and to reduce the death rate.
II. BREAST CANCER STAGING
Tumor size, lymph node involvement, tumor grade,
whether the cancer has spread to other parts of the body
An Efficient CBIR Approach for Diagnosing the
Stages of Breast Cancer Using KNN Classifier
Jini.R. Marsilin and Dr.G. Wiselin Jiji
T
Bonfring International Journal of Advances in Image Processing, Vol. 2, No. 1, March 2012 2
beyond the breast are used to categorize the stage of the breast
cancer. Staging information of the breast cancer helps the
doctor to understand the disease and make decisions about
treatment.
Stage 0 describes noninvasive breast cancers and there is
no evidence of cancer cells breaking out of the part of the
breast, in which they started. Stage I measures tumor size up
to 2 cm and no lymph nodes are involved. It describes
invasive breast cancer. Stage II (A or B) tumor size is 2 to 5
cm. The cancer may or may not have spread to lymph nodes.
It is the invasive breast cancer. Stage III (A, B, or C) is the
advanced stage (i.e.) the cancer is any size and has spread to
lymph node within the breast itself and has spread to the chest
wall and/or skin of the breast. Stage IV cancer has spread the
lymph nodes and also spread to other parts of the body, most
often the bones, lungs, liver, or brain.
Stages I, IIA, IIB, and IIIA are the "early-stage" breast
cancer. When comparing the data of 117 breast cancer
patients, the study found that over 51.3% cases were in
clinical stage II, 21.4% in clinical stage III and 11.1% in
clinical stage IV [7]. It enforces the need for improved
screening techniques and increases the awareness of women
about the potential risk of breast cancer for early detection.
The proposed scheme allowed the development of content
based image retrieval systems, capable of retrieving images
based on their similarity with the query image and identifies
the correct stages of the breast cancer.
III. METHODOLOGY
We propose a CBIR approach using pattern similarity
scheme to diagnose the real stage of breast cancer using
mammographic images. The retrieval process is illustrated
using the flowchart shown in Figure 1.
Query mammogram image

Retrieved Image

Figure 1: Block Diagram of Proposed Image Retrieval System
The low level features are extracted from mammogram
image. These extracted features are then use k Nearest
Neighbor (KNN) classification. Each group is considered as
patterns. Structural and measure components distance are
identified and similarity between two patterns is estimated
using the distance measures of both components. Using this
similarity measures the most similar images are retrieved with
respect to the query image. The retrieved image is used to
identify the real stage of breast cancer.
A. Pattern Base (PB)
It keeps the extracted patterns information from the
images. It consists of 3 basic layers. Pattern type defines the
description of the pattern structure. Pattern is the instance of
the corresponding pattern type. Class is a collection of patterns
of the same pattern type.
Pattern type PT is defined as a pair PT = {SS, MS} or
p={s, m} where SS is the pattern space by describing the
structure schema of the pattern type. MS is the quality of the
source data representation. A pattern-type PT is called
complex if its structure schema SS includes another pattern
type, otherwise PT is called simple [11].
B. Low-Level Image Feature Extraction
Feature extraction means the process of determining the
relevant content of the images. Color, shape, and texture [12]
are the important features commonly used in CBIR. Texture
feature plays an important role in medical image
interpretation. Image texture is a function of the spatial
variation in pixel intensities (gray values) in a spatial
neighborhood. It is a connected set of pixels satisfying a gray
level property. Texture analysis is used in the applications like
remote sensing, automated inspection, and medical image
processing. Co-occurrence matrix and run length matrix are
used as texture analysis tools. In this paper entropy,
contrast, energy, correlation, homogeneity and run length
matrix texture features are used.
Entropy, contrast, energy, correlation, homogeneity
features are calculated using gray level co-occurrence matrix.
The run length features are computed from run length matrix.

Figure 2: Co-Occurrence Matrix Directions
Spatial gray level co-occurrence matrix estimates
properties related to second-order features from the image.
The GLCM is defined as how often different combinations of
pixel intensity values (grey levels) occur in an image. Gray
level co-occurrence matrix (GLCM) [13] captures the spatial
dependence of gray level values within an image. It is also
known as the gray-level spatial dependence matrix. Co-
occurrence matrix is often formed using a set of offsets 0, 45,
90, and 135 degrees. Offset is often expressed as an angle that
Feature Extraction
KNN Classification
Pattern Instantiation
Pattern Similarity

Pattern Base
Medical Image
database
90
0

45
0

135
0

0
0

is used to specify the distance between the pixel of interest and
its neighbor.
Contrast is related to the dynamic range of gray levels in
an image. It measures the intensity contrast for the pixel of
interest and its neighbor.
d
C (i, j) is the co-occurrence matrix
with pixel i, j.

i j
d
j i C j i ) , ( ) (
2
(1)
Energy is also known as angular second moment which
measures the sum of squared elements in the GLCM.

i j
d
j i C
2
) , (
(2)
Correlation estimates how correlated a pixel to its neighbor
over the whole image.
i j j i
d
j i C j j i i ) , ( ) )( (
(3)
Homogeneity measures the closeness of the distribution of
elements in the GLCM to the GLCM diagonal.
i j
d
j i
j i C
1
) , (
(4)
Entropy is the inverse measure of homogeneity.
i j
d d
j i C j i C ) , ( log ) , ( (5)
In Run length matrix, each element of P(i, j) represents the
number of runs with pixels of gray level intensity equal to i
and length of run equal to j along a specific orientation. For a
given image, a gray level run is a set of consecutive, collinear
pixels having the same gray level. Length of the run is the
number of pixels in the run [14]. Short Run Emphasis (SRE)
measures the distribution of short runs.

M
i
N
j r
j
j i p
n
1 1
2
) , ( 1
(6)
Long Run Emphasis (LRE) measures the distribution of
long runs.

M
i
N
j
r
j j i p
n
1 1
2
). , (
1
(7)
Gray-Level Nonuniformity (GLN) measures the similarity
of gray level values throughout the image.

M
i
N
j r
j i p
n
1 1
) , (
1
(8)
Run Length Nonuniformity (RLN) measures the similarity
of length of runs throughout the image.

N
j
M
i r
j i p
n
1
2
1
) , (
1
(9)
Low Gray-Level Run Emphasis (LGRE) measures the
distribution of low gray level values.

M
i
N
j r
i
j i p
n
1 1
2
) , ( 1
(10)
High Gray-Level Run Emphasis (HGRE) measures the
distribution of high gray level values.

M
i
N
j r
i j i P
n
1 1
2
). , (
1
(11)
Short Run Low Gray-Level Emphasis (SRLGE) measures
the joint distribution of short runs and low gray level values.

M
i
N
j r
j i
j i P
n
1 1
2 2
.
) , ( 1
(12)
Short Run High Gray-Level Emphasis (SRHGE) measures
the joint distribution of short runs and high gray level values.
M
i
N
j r
j
i j i P
n
1 1
2
2
). , ( 1
(13)
Long Run Low Gray-Level Emphasis (LRLGE) measures
the joint distribution of long runs and low gray level values.
M
i
N
j r
i
j j i P
n
1 1
2
2
). , ( 1
(14)
Long Run High Gray-Level Emphasis (LRHGE) measures
the joint distribution of long runs and high gray level values.

M
i
N
j r
j i j i P
n
1 1
2 2
. ). , (
1
(15)
Here M is the number of gray levels, N is maximum run
length.
r
n is the total number of runs.
p
n is the number of
pixels in the image.
C. KNN Classifier
K-nearest neighbor (kNN) classification [15] finds a group
of k training tuples (k nearest neighbors) in the training set
that are closest to the unknown tuple. To classify an unlabeled
tuple, the distance of this unknown tuple to the labeled tuple is
computed for identifying k-nearest neighbors and most
common class labels of these nearest neighbors are then used
to determine the class label of the unknown tuple. K-nearest
neighbor algorithm (k-NN) is a method of lazy learning.
Classification of unknown tuples can be done using the
closeness of unknown to the known according to some
distance/similarity function. Euclidean distance is used as the
distance metric. Euclidean distance between two tuples is
estimated by:

n
i
i i
x x X X dis
1
2
2 1 2 1
) ( ) , ( (16)
Once the nearest-neighbor list is obtained, the test tuple is
classified based on the majority class of its nearest neighbors.
If k = 1, then the unknown tuple is simply assigned the class of
its nearest neighbor.

D. KNN Algorithm
Input: the set of training tuples and unlabeled test tuple.
Process:
Compute the distance between unlabeled test tuple
and each training tuple.
Select the set of closest (k nearest neighbor) training
tuples to the unlabeled tuple.
Output: label the test tuple with the majority class of its
nearest neighbor.
E. Pattern Instantiation
After the classification each group is considered as
patterns. Specimen
i
is instantiated for each pattern P
i

representing a physical anatomic specimen in a medical
image:

Specimen
i
=
] [Re : ], [Re : ( :
) ]] [Re : ], [Re : [ : ( :
1
al SV al pp MS
al al D SS
N

(17)
where Structure schema SS is represented by the pair (, ).
Measure schema MS is represented by two values, the prior
probability (pp) and the scatter value (SV) of P
i
. Prior
probability pp is defined as the fraction of the feature vectors
of the image that belong to pattern P
i
. SV is a measure of the
cohesiveness of the data items in a group with respect to the
centroid of that group. If the SV is low, it indicates good
scatter quality.
F. Pattern Similarity
Using the distance of the structures and the measures
components of two simple patterns P1 and P2, the pattern
similarity is computed. When comparing two medical images,
MI1 and MI2, the component patterns of MI1 must be
associated with the component patterns of MI2.
The distance of measure components using scatter value
and prior probability is:

(18)
For finding structural similarity between P1 and P2, first
find the standardized difference d between two distributions
by Cohens distance metric. It is calculated by:

d (D1,D2)=

(19)

If d=0, then distributions are identical. Low d value refers
quite similar distributions and high d value refers quite
dissimilar distributions.
Structural distance between two sets of distributions
should be the result of aggregate function. That is:

(20)

Distance between two patterns dis(p1,p2):

(21)
To compare two medical images MI1, MI2 adopt the
coupling methodology between the different patterns of each
image. This is given by:
dis(MI
1
,MI
2
) =
) , ( .
.
1
2 1
1 1
MI
j
MI
i
K
j
M
i
P P dis
K M

(22) M & K are the numbers of constituent simple patterns
of each image. The final outcome is the average of all possible
matching.
IV. RESULTS
The experiments are performed with the mammogram
images taken randomly from patients of different ages with
pathologic conditions. Mammogram image gives the internal
structure of the breast. Texture features are commonly used
features for differentiating normal and abnormal tissues.
Abnormal tissues in breast (tumor) had higher contrast than
normal tissues.

(a) Query Image (b) After Classification
Figure 3:

Figure 4: Detected Tumor in Stage I
otherwise D D
D or D if
D D
D D
, .
0 . 0 . ,
2
. .
. .
. 2 1
2 1
2
2
2
1
2 1
2
2 1 2 1
1 2 1
) , ( )). , ( 1 (
) 2 , ( ) , (
P P dis P P dis
P P dis P P dis
meas struct
struct
SV P SV P
SV P pp P SV P pp P
P P dis
meas
. .
. . . . . .
,
2 1
2 2 1 . 1
2 1
N j
D D d
g P P dis
j j
aggr struct
.. 2 . 1
,
) , (
2 1
2 1

Figure 5: Retrieved Images with Tumor in Stage I
Figure 4 shows the correct stage of the breast cancer.
Figure 5 shows similar past cases that are similar to the query
image. The classification rate using k nearest neighbor
classifier is around 85%. The retrieved past cases are used by
the doctors to diagnose the correct stage of cancer and get the
treatment in appropriate time.
V. CONCLUSION
We present a CBIR approach by storing the number of past
cases with pathologic conditions to retrieve the images for
identifying the real stage of breast cancer. A woman with
higher risk factor of breast cancers mammogram is compared
with the past cases for early detection. The kNN classifier
classifies query mammogram into normal & abnormal and
detects the tumor. Retrieval efficiency of this similarity
scheme is based on the classification rate of the kNN
classifier. This pattern similarity scheme is effectively applied
for large image retrieval tasks to predict the real stage of the
breast cancer. The primary goal of early detection is to
understand the correct stage of cancer in appropriate time and
reduce the death rate.
REFERENCES
[1] E.C. Fear, P.M. Meaney, M.A. Stuchly, Microwaves for breast cancer
detection, IEEE potentials, Vol. 22, Pp.12-18, 2003.
[2] K. Hunt, Breast cancer risk and hormone replacement therapy: a review
of the epidemiology, International journal of fertility and menopausal
studies, Vol. 39, No.2, Pp. 67-74, 1994.
[3] Y. Ireaneus Anna Rejani, Dr.S. Thamarai Selvi, Early detection of
breast cancer using svm classifier technique, International Journal on
Computer Science and Engineering, Vol. 1, No. 3, Pp.127-130, 2009.
[4] J.Z. Wang, J.Li, G. Wiederhold, SIMPLIcity: Semantics-sensitive
integrated matching for picture libraries, IEEE Trans. Pattern
Anal.Mach. Intell., Vol. 23, No. 9, Pp. 947963, 2001.
[5] T. Deselaers, D. Keysers, H. Ney, FIREflexible image retrieval
engine: Image CLEF 2004 evaluation, in Lecture Notes in Computer
Science, Vol. 3491, Pp.688698, 2004.
[6] V.S.V.S. Murthy, E. Vamsidhar, P. Sankara Rao, Content Based Image
Retrieval using Hierarchical and K-Means Clustering Techniques,
International Journal of Engineering Science and Technology, Vol. 2,
No. 3, Pp.209-212, 2010.
[7] http://www.emro.who.int/publications/emhj/0503/01.htm
[8] H. Greenspan, A.T.Pinhas, Medical image categorization and retrieval
for PACS using the GMM-KL framework, IEEE Trans. Inf. Techol.
Biomed, Vol. 11, No. 2, Pp. 190202, 2007.
[9] J. Dheeba, G.Wiselin Jiji, Detection of Microcalcification Clusters in
Mammograms using Neural Network, International Journal of
Advanced Science and Technology, Vol. 19, Pp. 13-22, 2010.
[10] I. El-Naqa, Y. Yang, N.P. Galatsanos, R.M. Nishikawa, M.N. Wernick,
A similarity learning approach to content-based image retrieval:
Application to digital mammography, IEEE Trans. Med. Imag., Vol.
23, No.10, Pp.12331244, 2004.
[11] I. Bartolini, P. Ciaccia, I. Ntoutsi, M. Patella, Y. Theodoridis, A unified
and flexible framework for comparing simple and complex patterns, in
Proc. 8th Eur. Conf. Principles Pract. Knowl. Discov. Database (PKDD),
Pp. 496499, 2004.
[12] P.S. Hiremath, Jagadeesh Pujari, Content Based Image Retrieval based
on Color, Texture and Shape features using Image and its complement,
International Journal of Computer Science and Security, Vol.1, Pp. 25-
35.
[13] Dan Popescu, Radu Dobrescu, Maximilian Nicolae, Texture
Classification and Defect Detection by Statistical Features,
International Journal of Circuits, Systems and Signal Processing, Vol.1,
Pp. 79-84, 2007.
[14] Xiaoou Tang, Texture Information in Run- Length Matrices, IEEE
Transactions On Image Processing, Vol. 7, No. 11, Pp. 1602-1609,
1998.
[15] P-N. Tan, M. Steinbach, V. Kumar, Introduction to data mining,
Pearson Addison-Wesley, 2006.
[16] http://rokocancer.org/aboutcancer
[17] http://www.paragise.com/breastcancer.pdf

Jini.R. Marsilin is presently doing M.E (CSE) in Dr.
Sivanthi Aditanar College of Engineering, Tiruchendur.
Has completed B.Tech (IT) in JJ College of
Engineering and Technology, Trichy. Her field of
interest includes image processing and Data Mining.

Dr.G. Wiselin Jiji is presently working as Professor of
Computer Science and Engineering at Dr. Sivanthi
Aditanar College of Engineering, Tiruchendur and has
15 Years of experience in the field of Computer
science. Has published 35 Papers in the fields of
medical image processing and management studies.
Has completed two R & D projects sponsored by
AICTE and DRDO.

An Efficient CBIR Approach For Diagnosing The Stages of Breast Cancer Using KNN Classifier

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

An Efficient CBIR Approach For Diagnosing The Stages of Breast Cancer Using KNN Classifier

Încărcat de

Drepturi de autor:

Formate disponibile

Bonfring International Journal of Advances in Image Processing, Vol. 2, No.

S-ar putea să vă placă și