Sunteți pe pagina 1din 16

J. Vis. Commun. Image R.

17 (2006) 701–716
www.elsevier.com/locate/jvci

Evaluation and comparison of texture descriptors


proposed in MPEG-7
Feng Xu *, Yu-Jin Zhang
Department of Electronic Engineering, Tsinghua University, Beijing 100084, PR China

Received 1 March 2004; accepted 20 October 2005


Available online 29 November 2005

Abstract

Texture description contributes as one of the most important low-level features in content-based image retrieval. In
MPEG-7, homogeneous texture descriptor (HTD), texture browsing descriptor (TBD), and edge histogram descriptor
(EHD) have been proposed as texture descriptors. However, no comprehensive evaluation and comparison of these three
descriptors have been made. In this paper, we propose a comprehensive evaluation and comparison benchmark for feature
descriptors, especially for visual descriptors in MPEG-7. In the proposed benchmark, three texture descriptors in MPEG-7
are evaluated and compared. First, the descriptors are analyzed according to the standard criteria. Second, experiments are
implemented on the Brodatz texture image database. Analysis of the experimental results shows that each descriptor has
some specific characteristics and performs better than the other two in certain applications. The applicability is also sum-
marized for each descriptor. The survey as well as performance evaluation and comparison in this paper provide several
guidelines for using these descriptors in image retrieval and other applications.
 2005 Elsevier Inc. All rights reserved.

Keywords: Content-based image retrieval; MPEG-7; Homogeneous texture descriptor; Texture browsing descriptor; Edge histogram
descriptor

1. Introduction

MPEG-7, an international standard whose official name is ‘‘Multimedia Content Descriptor Interface,’’
aims at providing fundamental tools for describing multimedia content. MPEG-7 defines the syntax and
semantics to describe the multimedia content, which consists of seven parts: system, description definition lan-
guage (DDL), visual descriptor, audio descriptor, multimedia description scheme (MDS), reference software,
and conformance testing. In visual descriptor part, the visual descriptors are specified as normative descrip-
tors, basic descriptors, and descriptors for localization [1]. Some core visual descriptors are defined to describe
the color, texture, shape, and motion features of visual data.

*
Corresponding author. Fax: +86 10 62770317.
E-mail address: f-xu02@mails.tsinghua.edu.cn (F. Xu).

1047-3203/$ - see front matter  2005 Elsevier Inc. All rights reserved.
doi:10.1016/j.jvcir.2005.10.002
702 F. Xu, Y.-J. Zhang / J. Vis. Commun. Image R. 17 (2006) 701–716

In MPEG-7, three texture descriptors are defined as: homogeneous texture descriptor (HTD), texture
browsing descriptor (TBD), and edge histogram descriptor (EHD) [2,3]. Abundant researches on texture
description have been performed with the development of MPEG-7 [1,4–9].
Texture is one of the most important low-level features in content-based image retrieval (CBIR). As Zhang
[10] summarized, there are several models for texture representation, including texture model based on spatial
relationship (such as auto-correlation, co-occurrence gray-level matrix, fractal, and stochastic field), texture
model based on frequency relationship (such as power spectrum and wavelet), and texture model based on
perceptual structure. An effective texture descriptor can significantly improve the performance of image
retrieval. Therefore, evaluation and comparison between different texture descriptors are necessary.
Although MPEG-7 has exhibited many characteristics of the three texture descriptors and a significant
number of previous work have proposed several algorithms to improve their performance, there are no com-
prehensive evaluation and comparison, to the best of our knowledge, which have been performed. With wide
application of MPEG-7, the analytical and comprehensive evaluation and comparison among a series of
descriptors are crucially necessary in order that the most appropriate descriptor can be used effectively
and efficiently in some certain application. In this paper, the three texture descriptors in MPEG-7 are eval-
uated and compared in an overall and hierarchical framework. First, some common evaluation criteria,
including good retrieval accuracy, compact features, general application, low computation complexity, robust
retrieval performance, and hierarchical coarse-to-fine representation, are used in the analytical study as in
[11]. Second, evaluation and comparison experiments are implemented. Since texture-based image retrieval
is one of the most important applications for texture descriptors, we implemented image retrieval experiments
for evaluation and comparison. Image retrieval is implemented on a texture image database containing a sig-
nificant number of images so that the performance of each descriptor can be examined adequately. In the
image retrieval, the widely accepted Brodatz texture database [12] is used and some negative influencing fac-
tors are considered. First, the robustness of the retrieval performance to noise, rotation and compression is
considered. Second, the efficiency of the descriptors is compared with computing time. Finally, the retrieval
performances with different distance measures are investigated. All results are illustrated with precision and
recall indicators to compare descriptors performance [13]. Based on the evaluation and comparison on effec-
tiveness and efficiency, we summarized applicability for each descriptor.
As a survey and performance testing paper, the main contributions of this paper are in three aspects. First, a
comprehensive framework for evaluation and comparison of the feature descriptors is proposed, in which some
other series descriptors, especially visual descriptors in MPEG-7 such as color descriptors and shape descriptors,
can also be evaluated and compared. It provides a standard benchmark for descriptors evaluation and compar-
ison. Second, three texture descriptors defined in MPEG-7 are evaluated and compared. This provides the basis
of selection in application. Third, the applicability for the three texture descriptors is summarized.
The rest of the paper is organized as follows. In Section 2, three descriptors are briefly introduced and
explained. In Section 3, an analytical discussion and evaluation of the three descriptors are provided. In Sec-
tion 4, some experimental comparisons based on image retrieval are given in detail. In Section 5, the applica-
bility is summarized for each descriptor. Conclusions are presented in Section 6.

2. Texture descriptors

In this section, the principles of the three texture descriptors HTD, TBD, and EHD are described and
discussed.

2.1. Homogeneous texture descriptor

The homogeneous texture descriptor characterizes the region texture using the energy and energy deviation
in a set of frequency channels [2]. In MPEG-7, it is designed for texture-based image search and retrieval.
HTD is extracted by Gabor filter banks which partition the frequency space with equal angle of 30 in
angular direction and with octave division in radial direction. According to some previous results [1,6], the
best numbers of angular and directional parameters are 6 and 5, respectively, resulting in 30 channels in total.
A demonstration of the channels is illustrated in Fig. 1.
F. Xu, Y.-J. Zhang / J. Vis. Commun. Image R. 17 (2006) 701–716 703

Channel (Ci)

Fig. 1. Frequency space partition of HTD.

In each channel, the following 2-D Gabor function is applied to filter the image:
" # " #
2 2
 ðx  x s Þ  ð h  hr Þ
GPs;r ðx; hÞ ¼ exp  exp ; ð1Þ
2r2qs 2r2hr

where {xs = x0 Æ 2s, s = 0, 1, 2, 3, 4} are the center frequencies in the radial direction, and x0 is the center fre-
quency of the highest frequency channel, specified by 3/4. The corresponding bandwidths are {Bs = B0 Æ 2s,
s = 0, 1, 2, 3, 4}, and B0 is the largest bandwidth specifiedp by 1/2. ffi {hr = 30 · r, r = 0, 1, 2, 3, 4, 5} are
ffiffiffiffiffiffiffiffiffiffi theffi center
pffiffiffiffiffiffiffiffiffiffi
angles in the angular direction. In addition, rqs ¼ Bs =ð2 2 ln 2Þ, where rhr is a constant 30 =ð2 2 ln 2Þ.
After filtering, the first and second moments in 30 frequency channels are computed, together with the
intensity mean and standard deviation of the original image, to compose the HTD represented as a 62-dimen-
sional vector. Then the descriptor is quantized into 8 bits for each number according to the standardized
tables. A detailed introduction to HTD is given in [1].
When HTD is used for texture-based image retrieval and indexing, the general distance measure is city
block distance (absolute value distance):
X    
d ij ¼ ei  ej  þ d i  d j  ; ð2Þ
where dij denotes the distance between two images i and j; ei and ej denote the corresponding energy mean of
descriptor vectors; di and dj denote the energy deviation, respectively.

2.2. Texture browsing descriptor

TBD relates to a perceptual characterization of texture, similar to human visual characterization in terms of
regularity, coarseness, and directionality. This descriptor is useful for browsing application and coarse classi-
fication of texture [2], which is also computed based on Gabor filter banks. The dominant orientations and
scales are determined by projection, and then the regularity is computed. Finally, the TBD vector, also called
Perceptual Browsing Component (PBC) in [8], is integrated.
The TBD is a 5-dimensional vector expressed as:
PBC ¼ ½Regularity Directionality1 Directionality2 Scale1 Scale2 ð3Þ
where Regularity represents the degree of periodic structure of the texture. The larger the Regularity value is,
the more regular the pattern is. Directionalities represent for two dominant orientations of the texture while
Scales represent the two dominant scales of the texture. In Eq. (3), Directionality1 denotes the primary dom-
inant orientation and Directionality2 denotes the secondary dominant orientation. Analogously, Scale1 de-
notes the primary dominant scale and Scale2 denotes the secondary dominant scale. In MPEG-7, the
regularities are cast into four degrees: irregular, slightly regular, regular, and highly regular. The TBD vector
704 F. Xu, Y.-J. Zhang / J. Vis. Commun. Image R. 17 (2006) 701–716

can also be quantized according to the recommended tables. The detailed computing process for TBD can be
found in [8].
When TBD is used in texture-based image retrieval and browsing, Euclidean distance measure shown below
can be applied as the similarity measure.
vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
u 5
uX
d ij ¼ t
2
ðPBC ðiÞ  PBC ðjÞÞ ; ð4Þ
k¼1

where dij denotes the distance between two images i and j.

2.3. Edge histogram descriptor

The edge histogram descriptor represents for the spatial distribution of five types of edge in local image
regions, which are defined as four directional edges and one non-directional edge. The four directional edges
are generated by counting edges at 0, 45, 90, and 135 directions respectively. In the implementation of the
descriptor, an image is divided into 4 · 4 non-overlapping sub-images. Further, each sub-image is divided into
image-blocks (the number of blocks depends on specific application). The five types of edge information can
be extracted from the image-blocks by edge detection operators. Thus, for each sub-image a local edge histo-
gram with 5 bins is generated and the total of 80 histogram bins (16 sub-images multiplying 5 bins) is achieved
for the whole image. The division of sub-image and image-block is illustrated in Fig. 2.
Here, an image-block is denoted as B and the edge filter (2 · 2 matrix) coefficients are denoted as f(k),
k = 0,1,2,3. The magnitude m of each edge can be calculated as follows:
 
X 3 
 
m¼ B  f ðk Þ. ð5Þ
 k¼0 

If the maximum value among the five types of edge strength is greater than a pre-determined threshold, the
image-block is considered as containing the corresponding edge in it. Otherwise, the image-block contains
no edge. After all edge values of the same type are summed up in one sub-image, the five bins for different
edge types are obtained for each sub-image. The values of the edge bins are normalized by the total number
of blocks. Finally, the whole histogram can be quantized according to the recommended tables. The more de-
tailed computation process can be found in [9].
When EHD is used in texture-based image retrieval and indexing, Euclidean distance measure can be
implemented before quantization. However, the histogram intersection shown below is preferred after
quantization.
PL1  
k¼0 min H i ðk Þ; H j ðk Þ
P ði; jÞ ¼ PL1 ð6Þ
k¼0 H i ðk Þ

where H denotes the histogram of an image, and L denotes the total number of bins.

sub-image

(0,0) (0,1) (0,2) (0,3)

(1,0) (1,1) (1,2) (1,3) image-block


(2,0) (2,1) (2,2) (2,3)

(3,0) (3,1) (3,2) (3,3)

Fig. 2. Definition of sub-image and image-block of EHD.


F. Xu, Y.-J. Zhang / J. Vis. Commun. Image R. 17 (2006) 701–716 705

3. Analytical discussions

In the previous section, three texture descriptors are described in detail. An analytical discussion about the
similarities and differences of the three descriptors is given below.
The similarities among HTD, TBD, and EHD are as follows:

(a) HTD, TBD, and EHD are all perceptually meaningful. HTD captures energy features in all directions
and scales in Gabor transform domain, which represents texture exactly and comprehensively; TBD cap-
tures regularity as well as two dominant directions and scales, which is consistent with human visual per-
ception; EHD captures edge information after dividing the image into sub-images, which is statistical
description for texture.
(b) HTD, TBD, and EHD are all application independent. No prior knowledge or information about the par-
ticular type of texture is assumed.
(c) HTD, TBD, and EHD all have constant dimensions. The dimension is invariant once the feature type,
i.e., the texture descriptor, is selected and used.

The differences among HTD, TBD, and EHD are as follows:

(a) Feature domain. HTD is obtained from Gabor transform domain, i.e., energy domain; TBD is obtained
from both Gabor transform domain and spatial domain since the perceptual browsing component is
computed by the filtered image and expressed in the spatial domain; EHD is obtained from spatial
domain.
(b) Feature representation. The representation of HTD and EHD is not very compact. HTD is 62-di-
mensional and EHD is 80-dimensional, which may lead to time-consuming in image-to-image
matching. The representation of TBD is more compact. Its dimension is only 5, which can provide
quick browsing.
(c) Feature computation complexity. The computation process of TBD is more complex than that of HTD
since Gabor filtering is just the pre-processing for the computation of TBD. The computation of TBD
needs some extra processes to obtain the regularity as well as dominant directions and scales, as
described in Section 2. The computation process of EHD is the least complex one among these three
descriptors, as it can be directly obtained in spatial domain by edge detection operators. So, TBD is
the most time-consuming in computation.
(d) Type of features captured. HTD only captures global features providing holistic characteristic.
TBD captures both global features and local features. However, it only provides the perceptual
characteristic instead of precise numerical characteristic. EHD only captures local features that
provide particular numerical characteristic.
(e) Parameters or thresholds influence. For HTD, two crucial parameters are total numbers of orientation
and scale of Gabor filters, which are recommended as 6 and 5, respectively, in MPEG-7. For TBD,
besides the same parameter as that of HTD, a threshold for the candidate selection [8] is significantly
influential. For EHD, a threshold is important in determining which types of edge the image-block
belongs to.
(f) Hierarchical representation. HTD supports hierarchical representation including base represen-
tation (32-dimensional vector without the second energy moments) and enhanced representa-
tion (62-dimensional vector), while TBD and EHD do not support hierarchical
representation.
(g) Suitability for efficient indexing. HTD is extracted by Gabor filters which are successful in tex-
ture representation. Therefore it is quite suitable for texture based image retrieval and index-
ing. TBD is represented according to human perception and recommended for texture
browsing. However, it is quite unsuitable for image retrieval. EHD is histogram-based descrip-
tor and can be either used as an individual indexing tool or integrated with any other
histogram.
706 F. Xu, Y.-J. Zhang / J. Vis. Commun. Image R. 17 (2006) 701–716

4. Experimental comparison and evaluation

4.1. Experiment setting

To test the performance of the three descriptors, we implemented the texture-based image retrieval and
indexing system which mainly consists of feature extraction and image ranking based on similarity measure.
In the feature extraction, the feature vectors for each of the descriptors are computed and then stored in the
database. In the image ranking, a certain distance measure between the query image and the database images
is computed and all the images in the database are ranked as its distance value to the query. The detailed pro-
cess will be discussed below.
The well-known Brodatz texture image database including 112 images is used in the experiment. Every
image in size of 640 · 640 is divided into 25 non-overlapping sub-images that are gathered into a class natu-
rally. Thus, an image database with 2800 images belonging to 112 classes is prepared for image retrieval
experiment.
Common performance measures, i.e., precision and recall, are used as the evaluation criteria. They are
defined as [13]:

No. relevant images retrieved


precision ¼ ; ð7Þ
Total No. images retrieved
No. relevant images retrieved
recall ¼ ; ð8Þ
Total No. relevant images

Every image in the image database is used as a query in turn and drives image retrieval. Then, the correspond-
ing precision and recall are calculated for each image. For the whole image database, the precision and recall
are computed by averaging the precisions and recalls of all the images. Through image ranking, the top N
images are returned according to their similarity measures. If there is a relevant image in returned set, i.e.,
the returned image belongs to the same class of the query image, the number of relevant images retrieved
increases by 1. Finally, the corresponding precision and recall values are computed and recorded and average
precision and recall of all the images are achieved from all the 2800 images.
The three types of negative influencing factors are considered and investigated in detail here: robustness to
noise, rotation, and compression. Then the computational efficiency and the performance based on different
similarity measures are explored. For robustness, we present experimental results in two parts. First, the same
influences on different texture descriptors are evaluated and compared. The column graphs are used to eval-
uate the performance of the different descriptors under the same influence. The horizontal axis is the degree of
the influencing factor and the precision/recall changes with the influence degree. Second, the different influenc-
ing factors on each descriptor are investigated and compared. The curves of precision–recall are applied to
illustrate the performance of each descriptor. The horizontal axis is the recall and the precision changes with
the increase of recall.

4.2. Robustness

4.2.1. Influence of noise


Noise is often an unavoidable effect for image retrieval and indexing. De-noise filtering can be used, but
some texture information may also be eliminated by filtering. Therefore, whether a descriptor is robust or
not to noise is quite important.
To test the robustness of the descriptors to noise, Gaussian noise with zero mean and several different stan-
dard deviations has been added to each image in the database in our retrieval experiments. Three descriptors
are used as the retrieval features, respectively, and the Euclidean distance measure is used as the similarity
measure. The standard deviation of the noise ranges from 0, which has no effect to images, to 76.5, which
makes the image texture significantly illegible. Some discrete values are implemented in the experiments;
the corresponding noise standard deviations are 6.375, 19.125, 25.5, 38.25, and 76.5, respectively. When the
number of returned images is 20, Fig. 3 illustrates the experimental results, in which the horizontal axis
F. Xu, Y.-J. Zhang / J. Vis. Commun. Image R. 17 (2006) 701–716 707

A 35
30

Pr ec is ion ( %)
25
HTD
20
TBD
15
EHD
10
5
0
0 6.375 19.125 25.5 38.25 76.5
Noise
Precision with noise

B 30
25
Rec all ( %)

20 HTD
15 TBD
10 EHD
5
0
0 6.375 19.125 25.5 38.25 76.5
Noise
Recall with noise

Fig. 3. Precision and recall comparison with noise.

denotes the increase of noise standard deviation and vertical axis denotes precision/recall. Fig. 3 (A) presents
the precision and (B) presents the recall.
From Fig. 3, it can be concluded that noise has weak influence on three texture descriptors, especially for
HTD and TBD. For HTD, when the noise standard deviation increases little (such as 6.375), the retrieval pre-
cision and recall are almost non-decreasing. Then, the precision and recall decrease with the increase of the
noise standard deviation. Until the noise standard deviation equals 76.5 for which the image textures are
blurred significantly, the precision and recall only decrease by less than 5%. Although the precision and recall
are quite low for TBD, they also decrease little with the increase of noise standard deviation. So HTD and
TBD are quite robust to noise and suitable for texture based image retrieval and browsing, respectively, in
noisy circumstance, such as web image retrieval and fabric image retrieval. The possible reason is that Gabor
filter is stable to detect the orientations and scales. In noisy circumstance, though the texture patterns are
blurred, the energy distribution does not change too much so that the dominant orientations and scales
can be detected correctly. Relatively, EHD is affected by noise more than HTD and TBD. Since EHD is
extracted in spatial domain, the spatial distribution and edges change dramatically with the noise. So EHD
is not suitable for noisy circumstance.

4.2.2. Influence of rotation


Rotation is another important factor for image retrieval. It is desired that texture description is invariant to
rotation. Thus, it is necessary to investigate whether the three descriptors are robust to rotation or not.
A rotation experiment as in [14] is implemented. The pre-processing includes two steps. First, every image
in the database is rotated by a certain degree. For 25 images in the same class, the orientations are from 0 to
360 with different interval angles, respectively. The degree of interval angle ranges from 0 to 28.8, in which
some discrete values are implemented in the experiments. The corresponding degrees of interval angles are
3.6, 7.2, 14.4, and 28.8. Second, each rotated image is cropped and interpolated with bilinear method in
order to be in the same size as the original image. Then, the descriptor could be extracted from the rotated
708 F. Xu, Y.-J. Zhang / J. Vis. Commun. Image R. 17 (2006) 701–716

images. In the image ranking, the Euclidean distance measure is applied. When the number of returned images
is 20, the experimental results are illustrated in Fig. 4, in which the horizontal axis denotes the increase of
degree of interval angle and vertical axis denotes precision/recall. Fig. 4 (A) presents the precision and (B) pre-
sents the recall.
From Fig. 4, it can be found that rotation affects performance of three texture descriptors significantly. For
all the three texture descriptors, the precision and recall decrease excessively when the images are rotated.
There is not much difference for precision/recall when the degree of interval angle ranges from 3.6 to
28.8. However, the precision and recall almost achieve the minimum at the 14.4 interval since the total inter-
val angle of all the 25 images in one image class is 360. When the interval angle is smaller than 14.4, the
texture between two adjacent images does not differ to each other so significantly as that of 14.4. When
the interval angle is larger than 14.4, some images are rotated to the original images, leading to a little
increase of the precision/recall. On the whole, the performance of the three texture descriptors decreases dra-
matically even with a little rotation. The possible reason is that the texture pattern will be regarded as another
texture pattern after rotation. For HTD, the precision decreases more than 15% and recall decreases more
than 10% because the dominant orientations are shifted; for TBD, although the precision/recall is quite
low even without rotation, the precision/recall also decreases; for EHD, the precision and recall both decrease
nearly 10%. Though the EHD can express edge information to some extent, it is not as stably rotation-invari-
ant as some shape descriptors (such as Fourier Descriptor). It can be concluded that the three texture descrip-
tors are not robust to rotation, especially for HTD.

4.2.3. Influence of compression


As more and more images are stored in JPEG compressed format, it is important to investigate how to
obtain three descriptors directly from the compressed domain. DCT transform is the main compression tech-
nique used in the existing JPEG image compression standard. Images are first divided into 8 · 8 blocks and

A 35
30
Pr ec is ion ( % )

25
HTD
20
TBD
15
EHD
10
5
0
0 3.6 7.2 14.4 28.8

Interval Angle (degree)


Precision with rotation

B 30
25
R ec all ( % )

20 HTD
15 TBD
10 EHD
5
0
0 3.6 7.2 14.4 28.8

Interval Angle (degree)


Recall with rotation

Fig. 4. Precision and recall comparison with rotation.


F. Xu, Y.-J. Zhang / J. Vis. Commun. Image R. 17 (2006) 701–716 709

decomposed into DCT domain where pixel energy is packed into the same number of DCT coefficients. Sub-
sequently, the quantized DCT coefficients are zigzagged and entropy coded (Huffman codes). As both DCT
quantization and Huffman encoding in JPEG are implemented with simple look-up-table operations, the
major computing cost for compression/decompression lies in the forward or inverse DCT transform. If three
descriptors can be extracted directly from the compressed domain, the computing cost required in decompres-
sion can be avoided [15].
Here, we first investigate the influence of DCT compression on three texture descriptors. In general, a 2-
dimensional DCT can be expressed as below:
X
N 1 X
N 1  
ð2x þ 1Þup ð2y þ 1Þvp
C ðu; vÞ ¼ aðuÞaðvÞ f ðx; y Þ cos cos ; ð9Þ
x¼0 y¼0
2N 2N

where
( pffiffiffiffiffiffiffiffiffi
1=N ; u¼0
aðuÞ ¼ pffiffiffiffiffiffiffiffiffi .
2=N ; u ¼ 1; 2; . . . ; N  1

Since DCT can make energy convergence, only a few DCT coefficients can represent the image in compressed
domain. In our experiments, the DCT coefficients over a certain threshold will be preserved. The thresholds
equaling 10, 100, and 300 are considered respectively. Through the experiments on the Brodatz database, it is
found that more than 90% larger coefficients are preserved when the coefficients over 10 are kept; while about
10% larger coefficients are preserved when the coefficients over 100 are kept. Only 1% larger coefficients are
preserved when the coefficients over 300 are kept.
In Eq. (9), when u = 0 and v = 0, the following DC coefficient of an 8 · 8 image-block is obtained:
1X 7 X 7
C ð0; 0Þ ¼ f ði; jÞ. ð10Þ
8 i¼0 j¼0

Eq. (10) is related to the average pixel value inside the block. If we ignore all the AC coefficients and
repeat the average of pixel values 8 times along row and column direction respectively, we can reconstruct
an approximated image which has exactly the same size as the original one directly from the DC
coefficients.
Although it is difficult to extract reliable texture descriptors due to the loss of the details in compressed
domain, this DC coefficient algorithm provides primary foundation for content-based image indexing and
retrieval in compressed domain. For JPEG images, this algorithm just needs to complete the entropy (Huff-
man) de-coding of the image-blocks and de-zigzaging and de-quantizing the compressed format in turn [16],
followed by extraction of descriptors from the DCT coefficients without full IDCT. From Eq. (10), it is seen
that this type of extraction only applies one multiplication for each block, while full IDCT will require 4032
additions and 4096 multiplications for each block [17].
Then the descriptor extraction could be performed on the compressed images when the DCT coefficients
over different thresholds or only DC coefficients are preserved. In the image ranking, the Euclidean distance
measure is also applied. When the number of returned images is 20, the experimental results are illustrated in
Fig. 5, in which the horizontal axis denotes the threshold of preserved DCT coefficients and vertical axis
denotes precision/recall. Fig. 5 (A) presents the precision and (B) presents the recall.
From Fig. 5, it can be found that HTD and TBD are more robust than EHD to DCT compression. For
HTD, the precision/recall is almost non-decreasing when the coefficient threshold smaller than 100. For
TBD, the precision/recall decreases little even only DC coefficient is preserved. For EHD, the precision/recall
decreases little when the coefficient threshold smaller than 10. When more coefficients are discarded, the pre-
cision/recall decreases significantly. On the whole, the compression does not affect three descriptors signifi-
cantly because of the energy convergence of DCT transform. The more the larger coefficients are
preserved, the higher the precision/recall is maintained. The observation that EHD and TBD can maintain
the better performance than that of EHD is also due to Gabor filter. Although some information is discarded,
the main pattern is invariable so that Gabor filter can detect the dominant orientations and scales. On the
710 F. Xu, Y.-J. Zhang / J. Vis. Commun. Image R. 17 (2006) 701–716

A 35
30

Pr ec is ion ( %)
25
HTD
20
TBD
15
EHD
10
5
0
0 10 100 300 DC
Compression Coefficient Threshold
Precision with compression

B 30
25
Rec all ( %)

20 HTD
15 TBD
10 EHD
5
0
0 10 100 300 DC
Compression Coefficient Threshold
Recall with compression

Fig. 5. Precision and recall comparison with compression.

other hand, image retrieval performance only by DC coefficients shows that the descriptors extraction directly
from DCT compressed domain is possible, though it is quite difficult and unreliable due to the low resolution
of the DC images.
In the work by Huang et al. [18], an algorithm for retrieving JPEG compressed images based on weighted
texture features is proposed. In this algorithm, all texture features are extracted in DCT compressed domain.
First, the weight for each query is selected by training, and then a weighted distance measure is applied to the
matching. For Brodatz texture database, the average precision is significantly higher than that of the above
experiments. It shows that for image retrieval in compressed domain, other descriptors than three descriptors
proposed in MPEG-7 are required.

4.3. Performance of the three texture descriptors

For each texture descriptor, different influences weaken the same descriptor diversely. It is necessary to
compare the performance of the same descriptor under different influences, which can provide the instruction
to select the appropriate texture descriptor in a certain application.
A set of negative influencing factors in Section 4.2 is investigated again. First, the Gaussian noise with
zero mean and standard deviation equaling 19.125 is added on all the images and then image retrieval is
implemented by three texture descriptors respectively. Second, all images are rotated by 14.4 interval
angle between two adjacent images in the same image class and retrieval is implemented. Finally, three
descriptors are extracted directly from images with DC coefficient in DCT transform domain, computed
from Eq. (10), and then retrieval is implemented. The experimental results are illustrated as precision–re-
call curves in Fig. 6, in which (a) shows the performance of HTD, (b) shows the performance of TBD,
and (c) shows the performance of EHD. The horizontal axis denotes recall and the vertical axis denotes
precision.
F. Xu, Y.-J. Zhang / J. Vis. Commun. Image R. 17 (2006) 701–716 711

A 0.5
original
0.45 noise
rotated
0.4
compressed
0.35

0.3

Precision
0.25

0.2

0.15

0.1

0.05

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Precision-recall curve of retrieval using HTD

B
original
noise
0.25 rotated
compressed

0.2
Precision

0.15

0.1

0.05

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Precision-recall curve of retrieval using TBD

C original
noise
0.3
rotated
compressed
0.25

0.2
Precision

0.15

0.1

0.05

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Precision-recall curve of retrieval using EHD

Fig. 6. Precision–recall curves of image retrieval.

From Fig. 6, some fundamental properties can be concluded for each descriptor. HTD is quite robust to
noise. However, rotation and compression of only DC coefficients are affected more significantly to HTD.
Especially, when the recall is lower than 0.1, the precision decreases dramatically from the retrieval by the
712 F. Xu, Y.-J. Zhang / J. Vis. Commun. Image R. 17 (2006) 701–716

original image data. For TBD, the retrieval precision by the original image data is quite low. All the retrievals
under different influences do not perform well enough although the precisions do not decrease so much. It is
necessary to emphasize that the curve under compression is ascending on the whole, which proves that TBD is
quite unsuitable for image retrieval because the small number of returned images cannot provide an acceptable
precision. EHD is sensitive to all the negative influencing factors, in which rotation has the least affection. The
performances for three descriptors under compression are the worst, which proves the difficulty of extraction
directly from compressed domain. In Fig. 6, some precision–recall curves decrease quickly until to a minimum
point, and then increase a little. It accounts for the rank of relevant images. When quite a lot of relevant imag-
es are not ranked at the top of all the images, the precision–recall could have minimum. On the contrary, the
curve that the precision is monotonically decreasing with recall gives a good image retrieval performance.

4.4. Computational efficiency

To compare the computational efficiency of three descriptors, we recorded the computation time for
descriptor extraction and image ranking by similarity measure in Brodatz texture database by using a PC run-
ning Pentium IV-2.0G with 512 M memory. The program is made with C code.

4.4.1. Time consumption for descriptor extraction


To eliminate the time sway caused by some parameters in the algorithms and different images, 2800 pro-
cesses of descriptor extraction for all the images in the database have been implemented for each descriptor.
The average running time of those processes is given in Table 1.
It can be seen from Table 1 that HTD is the most efficient in descriptor extraction. Although it is not the
shortest descriptor, HTD requires the least computing complexity among the three descriptors. TBD is the
most time-consuming. In fact, the high-level feature is more complex. So it can be concluded that the more
perceptual the descriptor, the more complex the descriptor extraction is.

4.4.2. Time consumption for ranking


In practical applications, ranking time is more important for users. If feature data (or descriptors) of all
images are stored in the database of the system, the query time is just the retrieval time. To make the average
retrieval time more precise, we implemented 2800 processes of query and image ranking for all the images in
the database. The average running time is given in Table 2.
It can be seen from Table 2 that HTD is the least efficient in image retrieval. From the point of view of
distance measure, the computation of similarity measure for TBD is much faster than that of the other two
descriptors since its dimension is only five. Although the computation time for ranking is directly proportional
to the dimension of the vectors in general, HTD is more time-consuming than EHD because HTD consists of
real part and image part which doubles the computational complexity. So HTD is less efficient in real time
system even though it has high performance in retrieval. However, TBD is less suitable for retrieval due to
the low performance even though it has high computation efficiency. Relatively, EHD balances on the effec-
tiveness and efficiency.

Table 1
The average elapsed time of feature extraction
Texture descriptor HTD TBD EHD
Average running time 585 814 664
of feature extraction (ms)

Table 2
The average running time of ranking
Texture descriptor HTD TBD EHD
Average running of ranking (ms) 920 10 322
F. Xu, Y.-J. Zhang / J. Vis. Commun. Image R. 17 (2006) 701–716 713

4.5. Different similarity measures

Although the similarity measure is not included in the normative part of the MPEG-7 standard, it is impor-
tant in retrieval. Recently, a large number of successful measure functions based on different principles (such
as statistics, psychology, medicine, social, and economic sciences, etc.) have been proposed and implemented
to verify whether it is appropriate for retrieval. Minkowski distance is often used as similarity measure which
is accompanied by specific meaning in physics and mathematics. In this set of distance measures, two of the
most familiar measures are Euclidean distance and city block distance. Histogram intersection is another
important similarity measure for low-level feature, especially used for histogram features. There are also some
more complex similarity measures including perception, recognition and subjectivity.
Considering the recommendation of MPEG-7, some distance measures are implemented on three texture
descriptors in our experiments. The Mahalanobis distance is not considered because different descriptors would
require different covariance matrices and for some descriptors it is simply impossible to define a covariance
matrix.
When the number of returned images is 20, the corresponding precisions are shown in Table 3.
From the table, it can be seen that although the City block distance (defined in Eq. (2)) and Euclidean dis-
tance (defined in Eq. (4)) are different, there is no essential difference between their performances. Both of
them are quite effective. For HTD, the weighted city block distance is weighted by normalized deviation of
the energy moment, which improves the precision significantly at the cost of computation. For TBD, it is very
difficult to find a good matching measure as mentioned in [19]. For EHD, Euclidean distance is efficient
enough without quantization. After quantization, histogram intersection is more efficient and convenient
for combined retrieval with other histogram features.

5. Applicability

The three texture descriptors have been used widely in many applications. In the following, we try to sum-
marize the applicability for three descriptors from some typical existing applications. As recommended in
MPEG-7, HTD is suitable for texture-based image retrieval; TBD is suitable for texture-based browsing;
EHD is suitable for texture and shape-based image retrieval. Detailed discussions are given below.

5.1. HTD

(a) Suitable for texture-based image retrieval. This is the general application for HTD because it has much
better performance than that of many other texture descriptors. Some detailed comparison results can be
found in [6].
(b) Used in texture-based segmentation. Since HTD can capture the most salient features of a texture pattern,
different texture patterns in one image can be distinguished by it. For instance, a remote sensing image
can be divided into many regions according to its texture, in which vegetation tends to have distinct tex-
ture while ocean area is almost smooth [6].

5.2. TBD

(a) Texture browsing and indexing. This is the initially proposed application of this descriptor in MPEG-7.
The TBD vector represents the regularity, coarseness (scales) and orientations of a texture pattern, which
has more intimate correlation to Human Visual System (HVS). Several examples of texture patterns and
Table 3
The precisions of the different measures
City block distance Weighted city block distance Euclidean distance Histogram intersection
HTD 35.42% 50.61% 30.06% —
TBD 11.76% — 10.90% —
EHD 16.14% — 16.50% 22.83%
714 F. Xu, Y.-J. Zhang / J. Vis. Commun. Image R. 17 (2006) 701–716

their TBD are illustrated in Fig. 7, in which each texture pattern has regularity value and dominant ori-
entations and scales.
(b) Texture classification. Some similar textures belong to the same pattern class in HVS and perceptual con-
cept, which have the same or similar TBD. So texture patterns can be classified in terms of its TBD. Rec-
ommended by MPEG-7, texture can be classified into four categories from irregular and slightly regular
to regular and highly regular. In our experiment, a sort of classification for images in Brodatz database
according to the regularity is shown in Table 4.
(c) Detection of texture structural defects, just as in [20]. Humans have a surprising capability to easily find
imperfections in spatial structures. But this capability to perceive local disorder is quite weak in comput-
er vision. A survey on defect detection in texture is given in [21]. TBD can be used to detect texture struc-
tural detects due to its description consistent with human perception. We define structural defects as
irregularities—regions where regularity is significantly different to the dominant values of most other
regions. In our defect detection experiment, the image is divided into appropriate blocks in which the

D75 [3.448000 4 4 2 3] D46 [4.168000 2 3 3 3] D102 [5.680000 5 5 4 2]

Fig. 7. TBD expressions of some Brodatz textures.

Table 4
Texture classification for Brodatz database
Regularity value Number of images in Brodatz textures database
Regularity > 5 (highly regular) D6 D21 D34 D52 D53 D102
4 < Regularity < 5 (regular) D2 D3 D4 D5 D10 D14 D17 D18 D20 D22 D24 D25 D28 D33 D35 D36 D37 D38 D39
D40 D42 D43 D44 D46 D47 D50 D51 D54 D55 D56 D57 D60 D64 D65 D66 D67 D69
D73 D76 D80 D82 D84 D85 D86 D87 D93 D95 D97 D99 D101 D103 D104 D106
D107 D111
3 < Regularity < 4 (slightly regular or irregular) D1 D7 D8 D9 D11 D12 D13 D15 D16 D19 D23 D26 D27 D29 D30 D31 D32 D41 D45
D48 D49 D58 D59 D61 D62 D63 D68 D70 D71 D72 D74 D75 D77 D78 D79 D81 D83
D88 D89 D90 D91 D92 D94 D96 D98 D100 D105 D108 D109 D110 D112

Fig. 8. Result of defect detection (D35).


F. Xu, Y.-J. Zhang / J. Vis. Commun. Image R. 17 (2006) 701–716 715

regularity of TBD is computed. After obtaining the TBD vectors of all the blocks, we considered that the
defect lies in the blocks with significantly different regularity values to most of the other blocks. An
example is shown in Fig. 8.

5.3. EHD

(a) Suitable for texture and shape-based as well as color and texture-based image retrieval. EHD presents
local edge feature while some shape descriptors (such as Fourier Descriptor) involve global feature,
hence the combination between them can describe the edge of the object more precisely. Color is
another important low-level image feature, often described by color histogram. Therefore, color and
texture histograms are convenient to be combined and integrated to implement more effective image
retrieval.
(b) Used for a type of coarse representation of object edge. EHD consists of one non-directional and four
directional (vertical, horizontal, 45, and 135) edges, thus it can be considered as a local representation
of an object. Although it cannot be applied independently, it is a good supplement to object represen-
tation. An enhanced EHD is introduced in [9], in which the global information is included and better
retrieval performance is achieved.

6. Conclusions

In this paper, we propose a comprehensive evaluation and comparison benchmark for feature descriptors,
especially for visual descriptors in MPEG-7. In this benchmark, three texture descriptors, HTD, TBD, and
EHD defined in MPEG-7, are evaluated and compared. From the experiments on Brodatz texture image data-
base, robustness, computation efficiency, and similarity measure for the three descriptors are investigated.
HTD is the most robust to noise. The performance of TBD is weakened dramatically by all types of negative
influencing factors. Generally, it is difficult to make three descriptors effective in compressed domain. For
computation efficiency, HTD is less time-consuming than the other two descriptors in the feature extraction
while TBD is much less time-consuming than the other two descriptors in the image ranking. Euclidean dis-
tance and city block distance are both effective for these three texture descriptors.
In the future, the combination of the descriptors will be studied to handle complex texture images. The
combinations of these texture descriptors with other content descriptors are planned.

Acknowledgments

This work has been supported by the Grants SRFDP-20050003013 and NJUPT-K02089.

References

[1] Y.M. Ro, M. Kim, H.K. Kang, B.S. Manjunath, J. Kim, MPEG-7 homogeneous texture descriptor, ETRI J. 23 (2) (2001) 41–51.
[2] ISO/IEC JTC1/SC29/WG11, MPEG-7 overview, V. 8, Doc. N4980, 2002.
[3] ISO/IEC JTC1/SC29/WG11, MPEG-7 overview, V. 9, Doc. N5525, 2003.
[4] M. Grgic, M. Ghanbari, S. Grgic, in: Texture-based Image Retrieval in MPEG-7 Multimedia System, EUROCON2001, vol. 2, 2001,
365–368.
[5] P. Kruizinga, N. Petkov, S.E. Grigorescu, Comparison of texture features based on Gabor filters, in: Proceedings of the 10th
International Conference on Image Analysis and Processing, Venice, Italy, September 27-29, 1999, pp. 142–147.
[6] B.S. Manjunath, W.Y. Ma, Texture features for browsing and retrieval of image data, IEEE Trans. Pattern Anal. Mach. Intell. 18 (8)
(1996) 837–842.
[7] B.S. Manjunath, J.R. Ohm, V.V. Vasudevan, A. Yamada, Color and texture descriptors, IEEE Trans. Circ. Syst. Vid. Technol. 11 (6)
(2001) 703–715.
[8] P. Wu, B.S. Manjunath, S. Newsam, H.D. Shin, A texture descriptor for browsing and similarity retrieval, Signal Process. Image
Commun. 16 (2000) 33–43.
[9] C.S. Won, D.K. Park, S.J. Park, Efficient use of mpeg-7 edge histogram descriptor, ETRI J. 24 (1) (2002) 23–30.
[10] Y.J. Zhang, Content-Based Visual Information Retrieval (in Chinese), Science Publisher, Beijing, 2003.
716 F. Xu, Y.-J. Zhang / J. Vis. Commun. Image R. 17 (2006) 701–716

[11] D.S. Zhang, G.J. Lu, A comparative study of curvature scale space and Fourier descriptors for shape-based image retrieval, J. Vis.
Commun. Image Representation 14 (2003) 41–60.
[12] P. Brodatz, Texture: A Photographic Album for Artists and Designers, Dover, New York, 1996.
[13] H. Mùller, W. Mùller, D.M. Squire, S.M. Maillet, T. Pun, Performance evaluation in content-based image retrieval: overview and
proposals, Pattern Recognit. Lett. 22 (2001) 593–601.
[14] C.M. Pun, Rotation-invariant texture feature for image retrieval, Comput. Vis. Image Understanding 89 (2003) 24–43.
[15] H. Wang, A. Divakaran, A. Vetro, S.F. Chang, H. Sun, Survey of compressed-domain features used in audio–visual indexing and
analysis, J. Vis. Commun. Image Representation 14 (2003) 150–183.
[16] A.R. McIntyre, M.I. Heywood, in: Exploring Content-based Image Indexing Techniques in the Compressed Domain, Proceedings of
the 2002 IEEE Canadian Conference on Electrical and Computer Engineering, vol. 2, 2002, 957–962.
[17] J. Jiang, A. Armstrong, G.C. Feng, Direct content access and extraction from JPEG compressed images, Pattern Recognit. 35 (2002)
2511–2519.
[18] X.Y. Huang, Y.J. Zhang, D. Hu, Image Retrieval Based on Weighted Texture Features Using DCT Coefficients of JPEG images,
ICICS-PCM 2003, 3B2.2 P0367 (1–5).
[19] H. Eidenberger, Distance measures for MPEG-7-based retrieval, Technical Report TR-188-2-2003-20, Vienna University of
Technology, Austria, 2003.
[20] D. Chetverikov, Pattern regularity as a visual key, Image Vis. Comput. 18 (2000) 975–985.
[21] K.Y. Song, M. Petrou, J. Kittler, Texture defect detection: a review, SPIE: Applications of Artificial Intelligence X: Machine
Vision and Robotics 1708 (SPIE) (1992) 99–106.

S-ar putea să vă placă și