Documente Academic
Documente Profesional
Documente Cultură
V ISIO N
by
Andre V. Harrison
Baltimore, Maryland
December, 2012
Di!ss0?t&iori P iiblistMlg
UMI 3572710
Published by ProQuest LLC 2013. Copyright in the Dissertation held by the Author.
Microform Edition ProQuest LLC.
All rights reserved. This work is protected against
unauthorized copying under Title 17, United States Code.
ProQuest LLC
789 East Eisenhower Parkway
P.O. Box 1346
Ann Arbor, Ml 48106-1346
A b stract
From night vision goggles, to infrared imagers, to remote controlled bomb disposal
robots; we are increasingly employing electronic vision sensors to extend or enhance
the limitations of our own visual sensory system. And while we can make these
systems better in terms of the amount of power they use, how much information they
capture, or how much information they can send to the viewer, it is also im portant to
keep in mind the capabilities of the human who must receive this visual information
from the sensor and display system.
sensory system and that of the electronic image sensor and display system is one
where the least amount of visual information is sent to our own sensory system for
processing, yet contains all the visual information that we need to understand the
desired environment and to make decisions based on that information. In order to do
this it is important to understand both the physiology of the visual sensory system
and the psychophysics of how this information is used. We demonstrate this idea
by researching and designing the components needed to optimize the compression of
ABSTRACT
dynamic range information onto a display, for the sake of maximizing the amount of
perceivable visual information shown to the human visual system.
An idea that is repeated in the construction of our optimal system is the link
between designing, modeling, and validating both the design and the model through
human testing. Compressing the dynamic range of images while trying to maximize
the amount of visual information shown is a unique approach to dynamic range com
pression. So the first component needed to develop our optimal compression method
is a way to measure the amount of visual information present in a compressed image.
We achieve this by designing an Information Content Quality Assessment metric and
we validate the metric using data from our psychophysical experiments [in prepara
tion]. Our psychophysical experiments compare different dynamic range compression
methods in terms of the amount of information that is visible after compression [1].
Our quality assessment metric is unique in th at it models visual perception using
information theory rather than biology. To validate this approach, we extend our
model so that it can predict human visual fixation. We compare the predictions of
our model against human fixation data and the fixation predictions of similar state of
the art fixation models. We show that the predictions of our model are at least com
parable or better than the predictions of these fixation models [2]. We also present
preliminary results on applying the saliency model to identify potentially salient ob
jects in out-of-focus locations due to a finite depth-of-field [in preparation]. The final
ABSTRACT
A ck n ow led gm en ts
Firstly I would like to thank my advisor Professor Ralph Etienne-Cummings for
taking me into his lab. Ralph has always been patient with my progress and willing to
provide assistance when I asked for it. I wish to thank Professor Andreas Andreou for
being on my seminar and dissertation committee and for reading my thesis. I would
also like to thank Professor Mounya Elhali for reading my thesis and for being on my
dissertation committee. I thank Professor. Jacob Khurgin for being on my seminar
committee and I thank Professor. Sanjeev Khudanpur for his helpful discussions on
how to make the math in my model actually implementable.
Secondly I would like to acknowledge everyone and past and current from the
CSMS lab and Dr. Andreass lab; Garrick, Fope, Clyde, Franceso, Mike, Kerron, and
ZZ, for their fellowship and making the two labs feel like one big community. I would
like to give special thanks to R. Jacob Vogelstein for showing me the ropes when I
first joined the lab. I thank Phillipe Pouliquen for fulfilling his role of always being
a knowledgeable and a dependable person to come to when I was confused or stuck
ACKNOWLEDGMENTS
on pretty much any topic. I thank Kevin Mazurek for being someone whom I could
bounce ideas off of and with whom I could do sanity checks with. He is someone I have
come to count on whenever I need help, like when I need to submit my thesis. I thank
Alex Russell for his collaboration on the human fixation model. I would also like to
thank Xiaoyu Guo for being willing to take my lab IT duties though he managed
to dodge th at bullet, ND Ekekwe for being the lab orator and inspiring interesting
debates amongst the lab, Mike Chi for showing me the nuts and bolts of getting
things done in the lab, Srinjoy M itra for showing me better ways to collaborate on a
cadence design, Frank Tejada for his guidance in designing photodetectors and pads
in the 3D process, Joe Lin for donating his AER communication blocks to my imager
chip and for his help in testing my imager chips, Recep Ozgun for collaborating on
the photodetector designs for the 3D imager and for his friendship, Gozen Koklii for
somehow getting me to socialize, Tom Murray for donating his X-box for Halo night,
and Jie Zhang for actually taking over the lab IT duties.
I would also like to acknowledge my sponsors at the Army Research Lab and every
one I ve met at the Army Research Lab; Dr. Grayson CuQlock-Knopp, Dr. Adrienne
Raglin, and Samantha Wallace, for helpful collaboration on my different projects. I
would like to especially acknowledge Dr. Linda Mullins for her psychological expertise
in setting up, conducting, and analyzing the results of the psychophysical study. I
would like to acknowledge David Kregel for his help in collecting and tabulating the
ACKNOWLEDGMENTS
psychophysical data for analysis.
I would also like to acknowledge Mark Kregel for his help in creating one of the high
dynamic range images used in the psychophysical study. I also need to acknowledge
Timo Kunkel from the Graphics Group at the University of Bristol; Laurence Meylan
from the Munsell Color Science Laboratory Database; Rafal Mantiuk from the Max
Planck Institut Informatik, and Erik Reinhard, Greg Ward, Sumanta P attanaik, and
Paul Debevec from [5] for providing the source images used in our psychophysical
study.
Finally I would like to thank my parents and extended family for everything they
have done big and small for me.
vii
D ed ication
This thesis is dedicated to my parents and family for their steady confidence
when I wasnt so sure. ...
C o n ten ts
A bstract
ii
List o f Tables
xv
List o f Figures
xvi
1 Introduction
1.1
A p p r o a c h .......................................................................................................
1.2
1.3
12
14
2.1
14
2.2
15
ix
CONTENTS
2.3
2.4
Perceptual L im its.........................................................................................
21
...........................................................................
22
24
26
28
30
31
S u m m a r y ......................................................................................................
34
36
3.1
Introduction ..................................................................................................
36
3.2
38
39
43
45
46
3.3
3.3.1.1
47
3.3.1.2
49
51
3.3.2.1
R e tin e x ..............................................................................
52
3.3.2.2
54
CONTENTS
3.3.2.3
Bilateral F i l t e r .................................................................
55
58
59
3.4.2 Study I
..............................................................................................
60
3.4.2.1
M ethods..............................................................................
62
3.4.2.2
P ro c e d u re ..........................................................................
63
3.4.2.3
R e s u l t s ..............................................................................
68
3.4.2.4
70
73
3.4.3.1
P ro c e d u re ..........................................................................
73
3.4.4 R esults.................................................................................................
74
3.5 Conclusion.......................................................................................................
76
3.6 D iscussion.......................................................................................................
77
78
78
4.2
80
4.2.1
Model T y p e s ....................................................................................
81
4.2.1.1
Bottom-Up M o d e ls..........................................................
83
4.2.1.2
Top-Down M o d e l s ..........................................................
87
97
xi
CONTENTS
4.3.1 Model B a c k g ro u n d ...........................................................................
4.3.1.1
4.3.1.2
Wavelet Decomposition
4.3.1.3
99
99
.................................................
100
....................................
101
101
4.3.2.1
Image Decomposition
....................................................
102
4.3.2.2
103
107
4.3.3.1
Initial T e s ts ........................................................................
108
4.3.3.2
Looking F o rw a rd ..............................................................
110
111
5.1
Ill
5.2
113
114
117
122
125
E valuation......................................................................................................
127
127
127
5.3
5.4
xii
CONTENTS
5.4.3
Metrics
.............................................................................................
131
5.4.4
Center b i a s ......................................................................................
132
5.5
R e su lts............................................................................................................
134
5.6
D iscussion.....................................................................................................
139
6.2
142
142
6.1.1
145
6.1.1.1
The P i x e l ..........................................................................
146
6.1.1.2
Design Layout
................................................................
148
6.1.2
149
6.1.3
Measurement E r r o r ..........................................................................
151
Future W ork..................................................................................................
157
7 Conclusions
158
7.1
158
7.2
161
7.3
166
7.4
S u m m a r y ......................................................................................................
168
Bibliography
171
xiii
CONTENTS
V ita
198
xiv
List o f T ables
3.1 Paired Comparison Results and p>-values collapsed across participants
and scenes.........................................................................................................
3.2 Paired Comparison Results and p-values collapsed across participants
and scenes from the second s tu d y ..............................................................
4.1 Quality Assessment entropy measure of images used in the second psy
chophysical experiment...................................................................................
5.1 Quantitative Comparison of 3 models and a Gaussian kernel on the
MIT dataset using different similarity measures.......................................
5.2 The results of applying significance (p) on the results of the metrics
used in table 5.1..............................................................................................
6.1 Summary of the Imager properties...............................................................
xv
69
75
108
138
139
148
xvi
1
5
16
17
19
20
21
23
25
27
28
29
30
30
32
33
37
40
40
42
LIST OF FIGURES
3.5
3.6
3.7
3.8
3.9
3.10
3.11
3.12
3.13
3.14
3.15
3.16
3.17
3.18
3.19
3.20
3.21
4.1
4.2
4.3
4.4
5.1
5.2
5.3
5.4
43
44
45
48
50
53
54
56
60
62
64
65
66
70
72
73
76
xvii
LIST OF FIGURES
6.1
6.2
6.3
6.4
6.5
6.6
6.7
6.8
7.1
7.2
7.3
xviii
146
148
149
149
150
154
155
156
163
165
170
C h ap ter 1
In trod u ction
Vision is the primary sense that people use to
gather information about the surrounding environ
ment but in many situations our eyes are unable to
capture information from a desired environment. This
may be because the illumination level is too low, the
spectral range is beyond what our eyes can perceive,
or the environment is too dangerous for a person to
be in themselves. In these instances electronic image Figure 1.1: Diagram of Head
Mounted Vision System. Im
sensor and display systems, such as night vision gog age from [6].
gles (NVGs), infrared imagers, and tele-operation systems, have enabled us to capture
information about the environment th at our eyes could not.
CHAPTER 1. INTRODUCTION
In a sense these systems act as artificial eyes for the user and when these systems
are used in dangerous or unstable situations, such as in conflict zones or collapsed
structures, it is important to the person using them th at they gather as much infor
mation about their environment as quickly as possible. It is therefore im portant th at
the system captures and presents the optimal amount of perceivable information to
the user, while still being efficient and useful. In terms of acquiring as much visual
information about the environment as possible, the optimal sensor and display system
is one that does two things. First, the system must maximize the transfer of visual
information from the display to the user; in the form of displaying as much visual
information that the user can perceive. Secondly, it must maximize the transmission
of information from the image sensor to the display, thereby capturing no more infor
mation on the sensor than what the display will show. The optimal system then is a
system whose capture and display capabilities are matched to the perceptual limits
of the average person. In order to design such an optimal system it is first necessary
to understand what the perceptual limits of human vision are.
Another constraint in the design of an optimal system is th a t in order to create
an optimal sensor/display system the capabilities of sensors and displays must meet
or exceed those of human visual perception.
systems have been designed with certain capabilities th at exceed those of human
perception, like frame rates of > 10, OOOfps [14,15], dynamic ranges > 160dB [16-
CHAPTER 1. INTRODUCTION
18], and spatial resolutions above what is perceivable by the human retina (at the
typical usage distance) [19]. However, several parameters of human perception cannot
currently be matched by the current level of technology, like the dynamic range of the
display. Between the sensor and the display most of the limitations in the capabilities
of current systems come from the display.
capabilities of the display and the perceptual limits of vision is between the dynamic
range of displays versus that of perception. But even when the capabilities of the
sensor/display system fall short of the most optimal design, due to manufacturing
limitations, cost, the constraints of the application, or for some other reason, the
system can still be optimized (within the constraints of the design) to maximize the
amount of visual information perceived by the user. In this thesis we dem onstrate
this by presenting research geared towards maximizing information transfer when the
dynamic range of the display is lower than the dynamic range of the sensor and visual
perception.
A displays dynamic range or contrast ratio refers to the ratio of the brightest
and darkest illumination level th at a display can show, while the dynamic range of
the sensor or the dynamic range of perception is the ratio between the brightest and
dimmest illumination level that can be detected, or perceived, respectively. We must
also specify that we are concerned only with intra-scene dynamic range, which is
the ratio of the darkest and brightest light level that can be shown or perceived at
CHAPTER 1. INTRODUCTION
the same time. Most displays used in mobile systems have an intra-scene dynamic
range that spans only 2-3 orders of magnitude, while the intra-scene dynamic range
of perception can reach up to 5 orders of magnitude. So for any image th at has a
higher dynamic range than 3 orders of magnitude is to be shown, a high dynamic
range image, only parts of the image will be visible. Either the darkest parts of the
image will show up well leaving the brightest sections washed out or the brightest
sections will be shown clearly while the darkest parts appear completely black.
However, very little information about the environment is contained in the ab
solute illumination levels. The spatial and temporal structure of the environment
that we use to understand our environment is contained in the relative changes in
the illumination level. The human visual system (HVS) has evolved to detect these
relative changes in intensity. As a result the HVS is not very sensitive to absolute in
tensity levels; it is more sensitive to contrast and patterns in the spatial and temporal
frequency of the perceivable environment. So it is possible to compress the dynamic
range of the visual information captured by the sensor to fit onto the display so that
most of the visual information is perceivable by removing the requirement to preserve
the absolute intensity levels. Methods to do this are known as high dynamic range
(HDR) compression methods or tone mapping algorithms [5].
Numerous tone mapping algorithms have been developed over the years to com
press high dynamic range images (HDRIs) [5,20-24]. They vary in approach, com
CHAPTER 1. INTRODUCTION
plexity, and the different types of images they work well on, but they have all been
evaluated in terms of either how aesthetically pleasing the compressed image ap
pears or how perceptually similar the compressed image is in relation to the original
HDRI [25,26]. In this thesis we present research investigating how much visual in
formation is present in tone mapped images. This is an essential step towards the
goal of maximizing the information presented to the user when the display has a low
dynamic range. We approach this by trying to measure perceived visual information
shown on a display using psychophysical experiments and visual modeling.
1.1
A pproach
Design
Validation
through human
testing
Modeling
The aim of the research presented in this thesis is to maximize the amount of
5
CHAPTER 1. INTRODUCTION
visual information shown on a low dynamic range display through tone mapping
methods. The approach taken in this thesis has 3 research elements in it th a t are
important components towards this goal. The elements are the design of the compres
sion method, testing the design using psychophysical experiments, since th e system is
intended for human use, and modeling of the testing procedure to provide quantita
tive measurements of how well the design performs and to speed up the validation of
the design. Under this approach we have conducted the following research projects.
- We conducted two psychophysical experiments to evaluate the amount of infor
mation that is present after using different tone mapping algorithms.
- We designed and tested an information content quality assessment model to
estimate the amount of information present after an image has been compressed and
to compare the amount of information present between images.
- As a test to show th at the quality assessment model was a valid model of human
vision we extended the model to try and predict human fixation.
- We have designed a wide dynamic range spike based imager chip and we have
fabricated it in a 3D 3-tier fabrication process.
- We have also explored how to predict human fixation in finite depth-of-field
images.
Focusing on information content over aesthetic quality or perceptual similarity is
a new way of evaluating tone mapping methods. So as a first step it is im portant
CHAPTER 1. INTRODUCTION
to know how different compression techniques ranked along this new metric. We
can then potentially use this information to improve or tune existing tone mapping
methods to show more information.
Human testing is time consuming, can be expensive, and for all the effort put forth,
the tests may only give qualitative results, if any useful results at all are gained. We
have thereby designed a model to estimate the amount of visual information th at
is contained in a tone mapped image. Rather than taking a biological approach by
trying to model the different biological circuits that make up the visual pathway
to measure what details can be extracted from an image, our model is based in
information theory and uses an information theoretic concept of information, namely
entropy, to estimate the amount of detail in a tone mapped image.
Models that can give the same judgments as human observers can address all of
these problems, but the models themselves must also be validated to show th at they
are giving the same judgments as human observers. As a further test of the validity
of using information theory to model vision, we extend our model to predict human
fixation. To see an object at the highest resolution we must look at it, fixate on it,
using the central region of our eyes, the fovea. We demonstrate th at our approach is
a valid way of modeling vision by comparing the fixation predictions of our fixation
model to actual human fixation data and to other fixation models of a similar type.
When images are taken with a finite depth-of-field (DOF) objects at a particu-
CHAPTER 1. INTRODUCTION
lax range of depths are in focus, while objects outside of that depth are unfocused.
Usually this is done on purpose in order to direct the users attention to a particular
object or location. But in mobile sensor/display systems this may not be the case. If
these systems have a particular focal depth, because th a t is the depth of the object
or location they are currently fixated on, there may be something th at is im portant
to the viewer, but is now less salient because it is out of focus. We present prelimi
nary work on our use of our ideal observer model of fixation to see if we can identify
potentially salient or interesting locations even though they are out of focus due to a
finite depth-of-field.
When images are taken with a finite depth-of-field (DOF) objects at a particu
lar range of depths are in focus, while objects outside of that depth are unfocused.
Usually this is done on purpose in order to direct the users attention to a particular
object or location. But in mobile sensor/display systems this may not be the case. If
these systems have a particular focal depth, because th a t is the depth of the object
or location they are currently fixated on, there may be something that is im portant
to the viewer, but is now less salient because it is out of focus. We present prelimi
nary work on our use of our ideal observer model of fixation to see if we can identify
potentially salient or interesting locations even though they are out of focus due to a
finite depth-of-field.
CHAPTER 1. INTRODUCTION
1.2
M y C on trib u tio n
Through the research projects described in the previous section we have made the
following contributions.
- A simple extension of the bilateral filtering tone mapping method [23] th at im
proved the amount of visual information present after compression and to do this in
an automated way for any type of natural image. We confirmed the improved per
formance of the extension through the second of the two psychophysical experiments
we conducted [1].
- We have created a no reference information content based quality assessment
model that uses information theory and natural scene statistics to estim ate the
amount of information th at is contained in an image [Preliminary Results]. The model
also identifies the automated bilateral filtering algorithm as the algorithm with the
most amount of information, matching the previously collected human data.
- We have extended our quality assessment model to function as an ideal observer
model of human fixation. The model uses only bottom -up information and utilizes
no training or learning to predict human fixation [2].
- In testing the performance of the ideal observer model we extended the normal
ized scanpath saliency metric [27] to allow it to compensate for the inherent center
bias in human fixation data [2].
- We have designed an encoding method that should allow the wide dynamic range
9
CHAPTER 1. INTRODUCTION
imager to have an increased dynamic range by reducing the errors in read out [4].
- We have extended our ideal observer model to handle the loss of focus due to
finite depth-of-field [preliminary results].
We evaluated different tone mapping methods using psychophysical experiments to
compare how much visual information can be perceived after each m ethod was applied
to a set of images. From these experiments we made a simple modification to improve
the performance of one of the tested methods, the bilateral filtering algorithm [23].
The modification improved the visibility of detail in images in an autom ated fashion.
To model how people perceive detail in images we developed a no reference quality
assessment metric that uses natural scene statistics to measure the amount of infor
mation in an image by estimating the amount of entropy in th a t image. We then
demonstrate how the quantitative results of the model compare with the qualitative
results from the psychophysical experiments.
In the third project we developed an ideal observer model, which is simply an
extension of the information content quality assessment model, but tries to predict
human fixation by assuming people look at the locations in an image th at have the
most amount of visual information. Using this model we compare its fixation predic
tions against the fixation data from recorded human experiments [28]. We also com
pare our models fixation predictions against the predictions by other fixation models,
namely the Itti model and the graph based visual saliency model (GBVS) [29,30].
10
CHAPTER 1. INTRODUCTION
However, there is no single agreed upon method to compare fixation predictions and
actual fixation data. Numerous metrics have been proposed to do this and we ex
tend one of these metrics, the normalized scanpath saliency metric, so th at it can
compensate for center bias [27]. Center bias is the tendency of humans to fixate on
the center of an image. W ithout compensating for this behavior an uncompensated
fixation metric will give overly optimistic performance results.
For the wide dynamic range spike based imager we present the design of th at
chip and we have proposed a mixed-mode read out scheme that will allow lower error
measurements of illumination level and/or a reduction in the power consumed by
spike based imager designs. The mixed-mode readout uses both time-to-first spike
(TTFS) and spike rate information, [16,31-34], to estimate the illumination for each
pixel, which lessens the maximum spike rate and timing clock speeds required for
a given level of accuracy. We present both theoretical and SPICE analysis of the
benefits of using the mixed-mode read out method.
Most models of fixation assume the input image has an infinite depth-of-field or
they simply use the loss of focus due to a finite depth-of-field to modulate their
saliency maps [35]. However, the loss of focus due to a finite depth-of-field does not
change the fundamental amount of information a location or object may give to the
viewer about the environment if they were to simply focus on it. Thus we extend
our ideal observer model to try to predict where people should look and wrhat they
11
CHAPTER 1. INTRODUCTION
should focus on in order to gain the maximum amount of information from a scene.
We do this by compensating for a reduction in salience due to loss of focus in finite
depth-of-field images. We present preliminary work on our approach to this problem.
1.3
T h esis O u tlin e
12
CHAPTER 1. INTRODUCTION
In chapter 5 we present the extension of the quality assessment model to the task
of predicting fixation. We begin the chapter with some background on other fixation
models. We describe how the model was extended in order to predict fixation and
we end the chapter by comparing the performance of our models predictions against
actual human fixation data. We also compare our models predictions against the
predictions of other fixation models.
In chapter 6 we present the design of our wide dynamic range spike based imager.
We also present our proposed mixed-mode readout method and the theoretical and
simulation results that show the potential benefits of the approach.
In chapter 7 we conclude this thesis with a summary of the thesis. We also present
the preliminary work on extending the ideal observer model to compensate for loss
of focus due to finite depth-of-field. We end by discussing the future work of using
our model and psychophysical d ata to potentially find and identify the optimal way
to tone map an image in order to maximize the amount of visible information.
13
C hapter 2
H um an V isu al P ercep tio n L im its
2.1
In trod u ction
14
2.2
P h y sio lo g y o f th e E ye
In order for an object to be seen light must first bounce off that object and pass into
at least one of our two eyes. The average adult human eye is approximately spherical
with a radius of 12.5mm. The eyes have an average separation between 50mm and
75mm. The field of view of each eye, the region of the surrounding environment th at
each eye can see at any given moment in time, is largely defined by the position of
the eyes and the shape of a persons face. Thus it varies among individuals, but it
is approximately 95 out from the face, 75 down, 60 towards th e nose, and 60 up.
These numbers use the optical axis of the eye as the center [36].
When light enters the eye it must pass through the cornea, iris, and lens before it
falls upon the retina at the back of the eye. The cornea is a transparent part of the eye
covering the iris, and pupil. Light th at enters the eye must pass through the cornea
first before hitting any other part of the eye. The cornea refracts light th at passes
through it and acts to focus light that passes through it onto the retina. Though the
15
280
270
260
Figure 2.1: Perimeter chart showing the average visual field of the right eye; Numbers
on the perimeter indicate degrees of arc. Adapted from [7].
fixation depth is fixed it is still responsible for two-thirds of the eyes focusing power.
The pupil is an opening in the iris through which light passes before hitting the
lens and retina. The size of the pupil can vary from 2mm in the brightest light
conditions to 8mm at full dilation. It is controlled by the iris to limit the amount of
light that hits the retina. It is one element of the eyes ability to adapt to changing
light levels and is capable of reducing the amount of light hitting the retina to l/1 6 th
its original value, at full dilation.
The lens is a transparent structure behind the iris th at can change shape to change
the focal distance of the eyes. Like the cornea light th a t passes through the lens is
refracted to bring it into focus on the retina. Unlike the cornea, the lens can change
shape changing its refractive properties, thereby changing the focal length of the eyes.
16
Cornea
Retina
Optic nerve
Figure 2.2: Diagram of the human retina. Image adapted from [8] and [9].
This process is known as accommodation. By undergoing accommodation the lens in
the eye can bring light from different depths into focus on the photoreceptors in the
retina. The lens is responsible for the last third of the focusing power of the eyes.
The retina lies at the back of the eye and covers an average area of 981mm2. The
17
retina is in the shape of a spherical cap at the back of the eye and is centered at the
visual axis with an eccentricity of 110. I use perimetric angles to describe the size
of the retinal regions for generality. The retina can be divided into roughly 3 regions,
at the center of the retina is a circular region called the fovea which is centered at the
visual axis and has the highest density of photoreceptors in the retina, but only has a
radius of 2. Directly encompassing the fovea is the central retina, which has an outer
radius of 22. The rest of the retina is part of the peripheral retina and has an outer
radius of 110. Within the less than 5mm thickness of the retina, light is captured
and the first steps in the processing of the detected visual information occurs.
In the retina there are two types of photoreceptive cells, cone cells and rod cells.
There is only one type of rod cell but there are three types of cone cells, short, medium,
and long wavelength cones; categorized by their spectral sensitivities, figure 2.3. The
different spectral sensitivities of the different types of cone cells enables the perception
of color in vision. Cones are also instrumental in the perception of the finest spatial
details in vision because the fovea, which has the highest density of photoreceptors,
only contains cone cells, figure 2.4. Outside of the fovea the density of cone cells
drops sharply, but maintains a non-zero value even into the peripheral retina area.
The other type of photoreceptive cells in the retina is rod cells and while rod cells
are not found in the fovea they have a high density everywhere else. In fact there are
more rod cells in the retina than cone cells. Rod cells are monochromatic cells and
18
only detect changes in illumination, but they are much more sensitive to light than
cone cells. Rod cells are sensitive to illumination levels from (106c d / m 2 102cd/m 2)
while cone cells are sensitive to illumination levels from (103cd/m 2 108cd/m 2).
.20
.18
.16
.14
3 -12
.10
.02
400 440 480 520 560 600 640 680
w a v e le n g th (nm )
Figure 2.3: Spectral sensitivities of the (s)hort, (m)edium, and (l)ong wavelength
cones.
When light hits the retina, for th at light to be detected it must pass through
the entire thickness of the retina until it hits the photoreceptors located at the back
of the retina that convert light into an electrical signal. The electrical signal is in
the form of a series of action potentials, or spikes. The frequency of those spikes is
reduced with increasing light levels, and increases as the amount of light falling on
that cell is reduced. The spiking electrical signal is then passed back up through the
retina where early visual processing occurs to extract features like spatial contrast
and orientation [36]. The cells in the middle of the retina responsible for this are the
19
Blind Spot
160-
Cones
Rods
140-
Rods
120 -
u
O2
S
ofiC 3e|
' 00 -
8060-
4020-
Cones
Cones
070
60
SO
40
30
20
10
10
20
30
Temporal on retina
40
50
60
70
80
Nasal
PERIMETRIC ANGLE (d e g re e s )
Figure 2.4: Density of rod and cone photoreceptors in the retina. Adapted from [8].
bipolar, amacrine, and horizontal cells. At the highest layer near the surface of the
retina lay the ganglion cells, which are the last cells to process the information before
it is sent on to the rest of the brain through the optic nerve, figure 2.5. The ganglion
cells integrate information over their receptive fields to look for features at different
spatial scales. A receptive field, in terms of the visual system, is a region of space
from where a neuron receives stimuli. In terms of vision a receptive field at a given
moment in time will correlate with a set of photoreceptors in the retina. The receptive
field of a ganglion neuron is then the local set of photoreceptors th at a ganglion cell
receives information from. These fields can be as small as two photoreceptors or are
large as a few hundred photoreceptors [37].
20
2.3
P ercep tu al L im its
There are six perceptual properties of the human visual system that can be used to
constrain the design of a display so that its capabilities are matched to the perceptual
limits of human vision. The properties are spatial resolution, temporal resolution,
depth perception, spectral resolution, the absolute sensitivity of the visual system
and the relative sensitivity of the visual system. The visual system would be simple
to define in this way if each of the six properties was independent, but numerous
factors can affect the perceptual limits of each property. This section will describe
each property, the absolute perceptual limits of each property (for the average person),
factors that limit the absolute limits, and how these factors affect these limits.
21
2.3.1
decrease in spatial resolution does not follow the density of cones or rod cells, but is
closer to the density of ganglion cells as shown in figure 2.6.
The limit of human spatial resolution is highly dependent on the temporal fre22
1000
Fovea
Tem poral
retina
100 i
Optics
*S
0
1
Cone aperture
C one density
10 r
G anglion d en sity
1m
Achromatic
acuity
Chromatic acuity
0-1
-70
-60
lO
30
-10
Retinal eccentricity (deg)
-30
50
70
Figure 2.6: Psychophysical data for the human eye and Optical and Retinal limits of
vision. Psychophysical data, achromatic and chromatic acuity, obtained using drifting
gratings at various spatial frequencies, but for a fixed temporal frequency, 8 Hz.
Chromatic Acuity obtained using red and green bars. Reproduced with Permission
[39j.
quency of the image, the contrast of the image, and the color of the image. The above
subsection is true for black objects on white backgrounds at maximum contrast, but
as the contrast level between the foreground and background decreases the maximum
resolvable spatial frequency also decreases as shown in figures 2.11 and 2.12 [40]. The
maximal resolvable spatial frequency of an object also decreases as the eccentricity
23
of where the image falls on the retina increases. This is shown by the achromatic
acuity line in figure 2.6, where the eccentricity of an object is the distance in degrees
between the fovea and the region of the retina th at is viewing th a t object.
2.3.2
A spinning wheel rotating forward at low rpms will appear to spin forward, but as
the rotational speed approaches the temporal sampling limits of the eye the wheel will
appear to slow down and then reverse direction. At the value when it ceases motion
the effective sampling limit of the eye has been reached. Unlike a video camera,
though, the human eye does not have a set sampling rate as the maximum sampling
or flicker frequency of the eye varies with eccentricity.
The human eye is highly sensitive to changes in contrast at around 10 Hz, but
as the temporal frequency of the illumination increases th at sensitivity quickly drops
off. The contrast sensitivity can drop by as much as two orders of magnitude when
the signal has a temporal frequency of 50Hz. This simply means th at if a sequence
of flashes is shown at around 50 Hz the contrast of the flash with respect to the
background must be increased in order to perceive th a t the stimuli is actually a
series of flashes and not just a continuous light source. Yet this cannot continue
down to infinitesimally short lengths of pauses between flashes. At the very least
eyes have a finite dynamic range whose upper limit is some finite illumination level.
24
0.5
ui
5-
,45
50
100
2.5
80
Figure 2.7: Temporal sensitivity for two different eccentricities (0 and 45 degrees);
Reproduced with permission [11].
It is also possible that the limitation is solely in time and any delays between two
flashes less than some minimum length threshold will not be detected as separate
flashes regardless of how bright the illumination level. Currently this information is
unknown. It has been shown th at for modest illumination levels Z lcdjvr? with a black
background the two flash fusion threshold was found to be below 11 ms [41]. The two
flash fusion threshold is a test in which a light flashes two times in the same location
with a short delay between the end of one flash and the start of another. The two flash
25
fusion threshold is the value of the delay between flashes when an observer is able
to, with 50% accuracy; distinguish between a single flash and two sequential flashes.
The precise threshold has yet to be determined since very few contrast sensitivity
as a function of temporal resolution experiments go beyond 80 Hz. W hat has also
been shown is that the temporal resolution of the eye is higher at the periphery of
the retina and at its lowest in the fovea [41]. The exact variance of the tem poral
resolution over the topography of the retina has as of yet not been characterized.
2.3.3
The photoreceptors in the human eye are sensitive to a narrow band of wavelengths
within the electromagnetic spectrum, namely the visible spectrum th at consists of
wavelengths from 400nm to 700nm. W ithin this spectrum a person can distinguish
on the order of 7 million different colors. Due to the band pass shape of the sensi
tivity of cones in the eye the visual system is not equally sensitive to wavelengths of
light in the visible spectrum. Different wavelengths of light are perceived by compar
ing how the three types of cone photoreceptors (short, medium, and long), respond
to light. In order to present a more linear color space, with respect to the color
sensitivity of the human eye, the CIE (Commission Internationale de lEclairage) de
veloped the CIELAB (CIE luminance, a*, b* and CIELCH (CIE luminance, chroma,
hue) charts. These charts specify a color by plotting 3 variables, L* (luminance),
26
frtrr ^
1fi
1+0.01765L
lo vL / > 10
J 1+
.511
forL
< 16
lo
0.0638C
se
+ 0.638
1 +0.0131C
SH = (FT + 1- F)SC
gL _
C4
T =
C 4 +1900
0.36 -I- |0.4cos(35 + H) \
ioxH = 164ortf > 345
0.56 -f |0.2cos(168 + H )\
fori 64 < H < 345
27
2.3.4
D ynam ic R ange
amount of light
the
incident on th e retina.Secondly
chemicals in the retina are able to alter the sensitivity of the rods and cones, thereby
adjusting the photoreceptors sensitivity to light. The third mechanism is the different
operating regions of the cone and rod photoreceptors. Cone cells can only respond
to light in the photopic and mesopic illumination regions. Rod cells are sensitive to
light levels in the mesopic and scotopic regions.
Luminance
I 6
I *
i ^
i*
l *
i8
_ju
.nr-hut...............
no color vision
poor acuity
photoplr
Figure 2.9: Dynamic Range of the Visual System. Adapted from [12].
D 10
<
O
3
Q.
E
<
ac>
o
c
o
Q.
CO
0c
10
-2
10,-3
10.2
10',0
10.2
.4
10
k6
10'
Flash Intensity
29
2.3.5
Contrast sensitivity tests show black and white gratings of varying intensity and
spatial frequency in order to determine how spatial and temporal frequency affect
when a person can still perceive changes in shading for different contrast levels. From
these experiments contrast sensitivity curves have been created along with models
of the contrast sensitivity of an average person. When the spatial or tem poral fre
quencies are greater than one, in cycles/deg or cycles/sec, respectively, the contrast
sensitivity is relatively constant as the temporal or spatial frequency is varied, but
it drops off linearly after 10 Hz or 10 cycles/deg. For spatial and tem poral frequen
cies less than 1 cycle/deg or cycle/sec, respectively, the contrast sensitivity decreases
for low temporal or spatial frequencies, respectively, as shown in figures 2.11 and
2.12 [43].
i
i
10
90
09
90
Figure 2.11: Contrast Sensitivity vs. Figure 2.12: Contrast Sensitivity vs.
Spatial frequency for various spatial Temporal Frequency for various temfrequencies o 1 Hz, 6 Hz, A 16 Hz, poral frequencies o 0.5, 4, A 16, and
A 22Hz. From [43].
A 22 cycles/deg. Image from [43].
30
2.3.6
D ep th P ercep tion
There are numerous techniques that people use to estimate the depth of the objects
they see. There are physiological methods like accommodation, vergence, binocular
disparity, and motion parallax. These physiological means are used in combination
with image cues for relative depth estimates, like object size, whether an object is
occluded or occludes another object, whether fine details or textures are visible, and
perspective effects, to name a few. A person may use some or all of these cues in order
to estimate the depth of objects they see, but depending on the depth of an object
some cues are the dominant source to estimate depth, figure 2.13. In this section we
focus on the limits of physiological based depth cues as they have a direct effect on
the design of a display and give absolute measures.
A ccom m odation: Eyes undergo accommodation when the lens in the eye changes
shape in order to bring an object into focus. When an object is too close to a per
sons eye, focus cannot be maintained and the object appears blurry. The minimum
distance a person can keep an object in focus is known as the amplitude of accom
modation. For the average adult the amplitude of accommodation is about 10cm.
Vergence: Eyes undergo vergence when they rotate so that each eye is fixated
upon the same object. When the optical axis of both eyes cross the eyes are said
to be converged. The minimum distance eyes of the average person can converge
on the same object before that person sees double or convergence can no longer be
31
1000
Texture
Brightness
Size
Motion Parallax
Binocular Disparity
Accommodation
100
10
1000
32
a different position and depth. The amount the lenss in each eye needs to change or
the amount the eyes need to rotate in order to fixate on an object is large for objects
near to a person, but decreases the further th at object is. So while for objects very
far away accommodation and vergence may still change the change is so small it is
no longer the most accurate way to estimate the depth of an object, figure 2.13.
0.06
01
0.2
0.4
09
Figure 2.14: Depth limit of motion parallax and binocular disparity. The open circles
show motion parallax (left axis) and the closed circles show binocular disparity (right
axis). Image from [46].
the eyes the mind can estimate the depth of objects in a scene. Using random dot
disparity gratings the thresholds at which depth could be perceived through motion
parallax and binocular disparity have been estimated [46]. For the motion parallax
experiment the subject viewed a computer screen containing dots at random positions.
When the subjects head moved, the dots on the screen moved according to the depth
they were supposed to have in order to create a motion parallax effect. For the
binocular experiment, two screens of dots were shown to the subject, one for each eye.
But dots in the second image were simply displaced versions of the first image, where
the displacement was inversely proportional to the intended depth of the dot. Motion
parallax and binocular disparity seem to have similar optimal thresholds around 0.20.4 cycles/deg in depth. Motion parallax had a measured maximum threshold of 6
arc seconds/cm, while binocular disparity had a measured maximum threshold of 20
arc secs of disparity. Binocular disparity has a threshold about twice as small vs.
motion parallax usually, but has a similar form, figure 2.14.
2.4
Sum m ary
In this chapter we presented an overview of the parts of the eye th a t have a direct
impact on the limits of visual perception. We then break visual perception into six
elements and define the limits of each element.
34
35
C hapter 3
Inform ation C o n ten t o f Tone
M ap p in g M eth o d s
3.1
In trod u ction
Of all of the perceptual limits of the human visual system (HVS), the property
that sensors and displays both fall significantly below the limits of human perception
is in their dynamic range. Basic imager designs typically have a dynamic range of
approximately 3 orders of magnitude [47], which is significantly less than the full and
intra-scene dynamic range of the HVS, which span 13 and 5 orders of magnitude
respectively. There are many hardware designs that improve the dynamic range of
images captured by the image sensor [16,31-34,48-50]. Even with these improvements
36
state of the art sensors may only have a dynamic range of 7 orders of magnitude [16],
which is still well below the full dynamic range of the HVS, but is greater th an the
intra-scene dynamic range of the HVS. Software techniques can in principle achieve
much larger dynamic ranges, but sacrifice the temporal sensitivity of the sensor to do
so [51].
The real limitation lies in the low dynamic range capabilities of displays. Most
displays only have a dynamic range of only 3-4 orders of magnitude. Even the latest
High Definition Televisions (HDTV) only claim to span 4-5 orders of magnitude,
though it is unclear if any displays are capable of actually achieving this dynamic
range in a single scene. So when a high dynamic range image (HDRI) is shown on
a low dynamic range display without doing any sort of dynamic range compression
on the image, only parts of the image may be visible while other regions may appear
featureless, figure 3.1.
(a)
(b)
Figure 3.1: Two uncompressed views of the same high dynamic range image, (a) The
darker regions of the image are visible, but the outside of the cave appears overexposed
and featureless, (b) The outside of the cave is visible, but inside the cave is black.
37
So long as image sensors have higher dynamic ranges than their displays it will
be necessary to find ways to compress the dynamic range of images to fit within the
dynamic range of the display, while maximizing the amount of information shown.
Dynamic range compression for the sake of maximizing the amount of visual informa
tion is a new way of evaluating dynamic range compression. In the next section we
present an overview of the basic techniques used to capture or create high dynamic
range Images. In section 3.3 we present a brief overview of the types of high dynamic
range (HDR) compression methods, known as tone mapping methods, and describe
the operation of several of these methods in detail. In section 3.4 we evaluate several
tone mapping methods using this new metric. We then implement a minor change
to one of the methods from the first study and show a significant improvement over
standard tone mapping methods.
3.2
38
3.2.1
Hardware m eth od s
The standard CMOS Active Pixel Sensor (APS) consists of a reset component,
the photosensitive element, such as a photodiode, an internal or external capacitance
on which charge is collected, and an analog to digital converter, as shown in figure
3.2
The standard operation of the pixel is as follows, the voltage on the photodiode,
v0, is set to some known value, Vreaet, and after reset the photodiode removes charge
from the capacitor at a rate linearly proportional to the level of illumination, thus
decreasing the voltage v0. After sometime t int, the voltage (v0) on the line is converted
to a digital value through an Analog-to-Digital Converter (ADC). The limitation of
this method is that for medium and high levels of light the maximum amount of
charge has been removed from the capacitor already, so the final voltage for these
two cases is the same, figure 3.3. Otherwise the integration tim e Unt is reduced to
39
reset
ADC
i -
Figure 3.2: Components of a standard CMOS Active pixel Sensor (APS) pixel.
a smaller value
illumination levels, but in the low light condition the amount of charge, Q, lost may
be too small for detection; this may be the result of noise or the low resolution of the
ADC.
range
j low light
j low light
40
One method to alleviate this limitation is to reduce or vary the integration time
by some factor. This increases the dynamic range up to 4-6 orders of m agnitude by
trying to look at the same scene with several different exposure and/or, integration
times [16,31,48]. The drawback of this design is th at the effective frame rate is
reduced since the same scene must undergo multiple integrations in order to maximize
the dynamic range 3.7.
Another method is to change the relationship between the photocurrent and the
illumination level. Instead of being linearly proportional, the photocurrent can be
dependent on the logarithm of the level of illumination on the photodetector, thereby
compressing the response of the photodetector and increasing the dynamic range by
up to 5 orders of magnitude [49,50], as shown in figure 3.4. The main drawback
of this methodology is that any errors or mismatch between transistors in different
pixels may alter the precise values of the illumination power to photocurrent transfer
function generating an exponentially larger error than what would exist in a regular
imager.
The previous designs waited a set amount of time to read out the voltage on the
integration line. An alternative design waits until the voltage or charge reaches a
certain value and then generates a digital pulse, spike, or event. There are numerous
names, but they generally mean the same thing. The rate of events are then counted
or the time between events is counted [16,31-34]. This method can achieve very high
41
^R eset
reset
CD
(a)
Figure 3.4: Simple diagram to show where changes are made to APS design to imple
ment a compounding sensor, (a) The Reset component is usually altered in order to
produce a logarithmic response in the transfer curve, (b) Charge removed for different
illumination levels for standard imager vs. logarithmic compression.
dynamic ranges of 6-7 orders of magnitude, but may also have significant drawbacks
depending on the exact implementation. Any design th a t uses a counter per pixel
to either count the time between pulses or the number of pulses in a set amount of
time will generally have a very low fill factor and pixel density due to the size of the
counter. Designs that count spikes or events off chip, such as event generator circuits,
must have very high read out speeds in order to read out from a large number of
pixels, especially if they are generating many events, as is the case when viewing
bright scenes. Using some type of adaptation in order to reduce the read out speed
requirement ends up limiting the temporal properties of the imaged scene; scenes th at
change quickly cannot be properly adapted to.
42
I'K nM
reset
ADC
1-bit
- r - Cv
low light
(a )
tm
(b)
Figure 3.5: Simple overview of the operation of an event generation pixel, (a) Generic
design of an event generator pixel, (b) Operation of event generator imager. Measured
illumination is inversely proportional to saturation time, th and t m.
Another method, which operates on a principle similar to the human eye, generates
an event or spike only when the illumination level has changed by some fractional
amount from the previous value [13]. In the end though even with adaptation, where
the sensor adjusts the integration time, compression level, or the threshold voltage
depending on the level of illumination, state of the art sensors still have dynamic
ranges below 8 orders of magnitude.
3.2.2
Software m eth od s
43
A d log /J
re se t
I '
photoreceptor
amplifying
differentiator
quantizers
Figure 3.6: Diagram of Spike Based Temporal Contrast Vision Sensor. Schematic
from [13].
multiple images of a single scene taken at multiple different exposures. Using these
images along with the exposure times these pictures were taken with the nonlinear
response function of the camera is reconstructed. The approach of this m ethod is
th at by knowing the exposure times and digital values in a series of images, the
actual radiant intensity levels in the scene can be estimated. This type of HDRI is
also known as a radiance map. In order to generate this response curve it is necessary
to solve a linear least squares equation to estim ate the non-linear logarithmic response
curve of the camera relating pixel values to illumination levels for a given exposure
time. This computation is expensive if done on chip, but for a digital camera over a
large portion of the output range the digital values are linearly proportional to the
illumination level. Using this assumption the radiance map calculation is significantly
simpler, requiring only calculating the scale factor between corresponding pixels in
each image.
44
3.3
Tone M ap p in g approaches
The need to compress a HDR image to fit onto a low dynamic range display is not
a new problem and has been an active area of research in the com puter graphics and
image processing fields for several years. The various compression methods th at have
been developed are referred to generally as tone mapping algorithms [5]. The goal in
the development of most tone mapping algorithms has been to modify a HDR image so
that when it is shown on a display with a lower dynamic range it appears perceptually
similar to the original HDR image or that it appears aesthetically pleasing [25,26].
In order for the compressed image to appear perceptually similar it is important for
the compressed image to keep all of the perceptual features th at the HVS tries to
extract from the original HDR image. Tone mapping algorithms can generally be
45
sorted into one of two categories global algorithms and local algorithms. In the rest
of this section we describe these two approaches and describe th e operation of some
example tone mapping methods for each category.
3.3.1
G lobal A lgorith m s
Global tone mapping algorithms are algorithms th at directly map the value of
a pixel in a HDR image to the value of a pixel in the compressed version. This is
a straight forward approach and the simplest way to compress and image. These
methods generate a one-to-one mapping between pixel values in the HDRI and the
tone mapped version, but the mapping is independent of the pixels location in the
image. Such techniques are also called tonal reproduction curves (TRC), but such
image independent techniques may not properly display all the details within the
scene, especially those in very dark or very bright regions.
Two of the simplest global tone mapping algorithms are to linearly or logarith
mically compressing a HDRI. The logarithmic compression of HDR images to create
a low dynamic range (LDR) image is probably the simplest tone mapping method
there is that is generally effective. Some other algorithms that fall into this category
are, the log adaption method by Drago [20], and the histogram adjustment by Greg
Ward Larson [21].
46
3.3.1.1
A simple method to compress high dynamic range images is to take the logarithm
of the values and linearly scale the result to fit within the dynamic range of the
display device. And while logarithmically scaling the image is better than simply
linearly scaling the values of the image, the compression can be excessive and the
feeling of high contrast within the scene can be lost, giving the image a washed out
appearance. To improve the quality of the image the base of the logarithm can be
varied on a per pixel basis such th at for dark pixels a low base logarithm is used
like log2, while for high pixel values a higher base logarithm is used like logio and
using the relation between logarithms allows a smooth interpolation from log2 to
logiQ. The idea behind the adaptation of the logarithm is that a low base logarithm
such as log2 has a low compression, preserving detail in dark and medium illumination
regions of the image, while logi0 has a much stronger compression, allowing a stronger
compression in the brighter regions of the image. This can compress the scene to fit
within the range of the display while controlling the compression so the impression
of high contrast within the scene is maintained.
Equation 3.3.1 shows the fundamental equation of the adaptive logarithmic tone
mapping algorithm.
default is 100cd/rn2, and B is a free parameter used to adjust the compression and
enhancement of features in the dark and bright regions, the default is approximately
47
Figure 3.8: Example image th at has been tone mapped using Adaptive Logarithmic
tone mapping method.
0.85. The value of
LAx v) =
^
V)
48
^ ' /m a x
'
))
'*
3 .3.1.2
H istogram A d ju stm en t
49
Figure 3.9: Example image that has been tone mapped using the Histogram Adjust
ment tone mapping method.
^fl'(-'<irran))
6- J i * m
bi
50
(3.3.2)
3.3.2
log(Ldm a x
Local A lgorithm s
The HVS for most illumination levels is sensitive to contrast, not the actual il
lumination level. Thus models th at incorporate the spatial properties of the image
as well as the illumination properties are called tonal reproduction operators or local
tone mapping operators. Local tone mapping operators use the local illumination
and/or gradient information around each pixel in the HDR image to determine the
value of th at pixel in the compressed image. These algorithms generally try to sepa
rate the scene into two parts, the reflectance differences and illumination differences
(due to light sources). The illumination differences, which usually have a low spa
tial frequency and incorporate a large area, are compressed, while the differences in
reflectance of the objects within the scene, which are usually of a low contrast differ
ences and have a high spatial frequency, are kept the same [22,23]. O ther methods
attem pt to do the same thing, but actually model the human retina and visual system
in order to perform these tasks [24].
Local tone mapping operators are usually more computationally intensive than
global tone mapping algorithms, but they are often more adaptable to different types
51
of images and generally produce better results, but they often have several parameters
that need to be adjusted and these parameters often depend on the properties of the
image. Thus local mapping operators are usually used in a post-hoc or iterative
fashion where a person must play with param eter values to get the image they want.
There are numerous types of local tone mapping algorithms the algorithms used in
the psychophysical study in section 3.4 and relevant to this chapter are the Retinex
algorithm by Jobson [52], the Photographic tone mapping operator [53], and the
bilateral filter algorithm [23]. A more complete review of all local and global tone
mapping operators can be found in [5].
3.3.2.1
R etin ex
The Retinex algorithm is part of a class of algorithms that assume a scene consists
of two parts; an illumination layer (the distribution of light from a light source(s)), and
a reflectance layer (the distribution of light due to reflections off of objects from the
light source(s). The illumination layer has most of the brightness information while
the reflectance layer contains most of the detail of the scene. In order to extract
the reflectance layer, which is assumed to be of a high spatial frequency, the image is
divided by a low pass version (blurred version) of itself. This is done at three different
spatial scales and the result of these ratios is then averaged together to produce a low
dynamic range scene with, hopefully, all of the features visible.
52
Figure 3.10: Example image th at has been tone mapped using the Retinex tone
mapping operator.
Equation 3.3.3 shows the fundamental equation of the Retinex algorithm. Ld(x, y)
is the display value of the image, I ( x , y ) is the high dynamic range luminance at pixel
(x,y), and F i { x , y , s i ) is a spatial filter of size s* and * indicates convolution. The
value of La is described in terms of pixel values.
53
Ld(x, y) (
l o g ( I ( x , y ) - l og( F( x , y , Si) * I( x, y ) )
(3.3.3)
ki=l
3.3 .2 .2
Figure 3.11: Example image th at has been tone mapped using the Photographic tone
reproduction operator.
3.3.2 .3
Vi(x,y,si)-V2(x,y,s2)
2^-%+Vi (x,y,a)
< e.
B ilateral F ilter
Bilateral filtering is another algorithm that attem pts to decompose the viewed
scene into an illumination and reflectance layer. To compute the base (illumination)
layer an edge preserving low pass spatial filter is used. The type of edge preserving
55
A
*
Figure 3.12: Example image that has been tone mapped using the Bilateral filtering
tone mapping operator.
filter used is the bilateral filter which computes for each pixel, the local weighted
average of that pixel with its neighbors. The contribution of each neighboring pixel
is not only weighted by its distance from the central pixel, but is also weighted by the
difference of that pixels luminance value from the central pixels luminance value, such
that pixels that have large differences in intensity are weighted less. The reflectance
56
layer is then estimated as the ratio between the original image and the base layer.
The base layer is then compressed while the reflectance layer is kept the same. The
final image enhanced scene is created by multiplying the compressed base layer with
the reflectance layer.
Equation 3.3.5 shows the fundamental equation of the bilateral filtering algorithm
where Js is the base illumination of pixel s = (x , y ), f and g are edge stopping
functions like a Gaussian or Lorentzian filter function, and w is a scale factor used to
compress the base illumination layer. The value of L j is described in terms of pixel
values.
A= r E / ( p - *)g(.ira pen
and k(s) =
pen
f ( p - s)g(Ip - I3)
57
i.) ir
(3.3.5)
3.4
P sych op h ysical S tu d y
The research that has been conducted in the evaluation of different tone mapping
algorithms has been mainly concerned with how perceptually similar a compressed
image is to the original HDR image and/or how aesthetically pleasing the compressed
image is [25,26]. However, a perceptually similar image does not necessarily maximize
the amount of visual information shown to the user. Because of this potential differ
ence we wanted to develop or find a tone mapping algorithm that actually maximized
the amount of visual information displayed to the viewer. To get an idea of what
types of tone mapping algorithms work best for this goal, we ran a psychophysical
study comparing well known tone mapping algorithms. These algorithms were chosen
based on their popularity, their potential to be implemented in a mobile vision sys
tem, and to ensure a representative sampling of the different types of tone mapping
algorithms. Based on the results of the first study, described below, we decided to
modify the bilateral filter tone mapping algorithm, one of the algorithms compared
in the study, to try and improve the amount of visual information it showed. In
the second psychophysical study our modified algorithm was tested against the top
four performing algorithms from the first study to determine if it actually did im
prove the amount of visual information shown. The following subsections describe
the procedures, results, and a discussion of those results from the first and second
psychophysical study.
58
3.4.1
P erception Lab
In order to conduct the psychophysical tests, using the various image processing al
gorithms we developed, we needed an experimentation room to run these experiments.
To do this we designed and built a visual perception lab, figure 3.13, which houses
multiple HDTVs on which different types of visual information can be presented.
The room has been designed so that only the visual stimuli during an experiment
comes from the TVs, and reflections are reduced by covering the walls and ceiling
in low m atte black foam material and carpeting the floor. The image of the Visual
Perception lab in figure 3.13 shows th at the visual stimuli can be displayed on one
or more of six HDTVs. The central screens cover the 40 central horizontal axis and
60 central vertical axis of the subjects field of view while the two peripheral screens
are mounted on a swiveling axis to increase the sense of immersion of the subject and
increase the horizontal FOV to 97.
We conducted both studies in the Perception Laboratory th at we designed, but
for these studies we only needed the central monitor, of the available 6 monitors.
All of the monitors that werent involved in the study were kept off during testing.
Participants sat in an adjustable chair, approximately 70 75 inches from the screen.
For both studies five tone mapping algorithms were compared and contrasted. We
used ten HDR images of different scenes for each experiment and ran them through the
five algorithms to generate a set of fifty test images, which we used in the different
59
3.4.2
S tu d y I
Tone mapping algorithms have historically been designed to create images percep
tually similar to their HDR source and/or that are aesthetically pleasing; which does
not necessarily maximize the amount of visual information shown, but as a starting
point we decided to see how well they actually did achieve this new metric anyway.
The tone mapping algorithms th at were included in the study were the Log Adap
tation algorithm by Drago [20], the Histogram Adjustment algorithm by Ward [21],
60
61
3.4.2.1
M eth od s
For the first study, thirty students and individuals from communities around The
Johns Hopkins University were used as test participants. Participants were between
the ages of 18 and 35, had a minimum visual acuity of 20/30. Their vision was tested
using an OPTEC 5000 vision tester and were paid $20.00 dollars per hour for their
participation. They were also asked to fill out a standard demographic questionnaire.
For the first study the following tone mapping algorithms were used to generate the
test images. The Log Adaptation algorithm by Drago [20], the Histogram Adjustment
algorithm by Ward [21], the Retinex algorithm by Jobson [52], the Photographic tone
mapping operator by Reinhard [53], and the Bilateral Filter algorithm by Durand
and Dorsey [23] shown in figure 3.14
(a) Retinex
(e) Photographic
Mapping
Tone
Figure 3.14: Example images generated using the different tone m apping algorithms.
62
3 .4.2 .2
P roced u re
For the first study after the vision test and questionnaire each subject was asked
to complete two tasks an object detection task, as an objective measure and an image
comparison task, as a subjective measure. Before each task the subjects had a trial
run of each task in order to become familiar with the test setup before the actual
experiment. The images used for each trial were of a different set than the ones used
in the subsequent experiments.
3.4.2.2.1
Im ages
The source HDR images we used were provided online by a variety of sources
the Timo Kunkel Graphics Group; Laurence Meylan from the Munsell Color Science
Laboratory Database; Rafal Mantiuk from the Max Planck Institut Informatik, and
Erik Reinhard, Greg Ward, Sumanta Pattanaik, and Paul Debevec from [5]. The set
of images we used, were selected based on the change in illumination level present in
the images and the number of different objects or features that could be identified
in the images and the number of objects in different illumination level regions of
the images. For the set of training images, figure 3.15, these criteria were somewhat
relaxed compared to the images used in the first study, figure 3.16.
63
(b) colorcube
(c) vertical
Figure 3.15: Training set of images used in the first study. All images shown are
compressed using logarithmic compression.
3.4.2.2.2
O bjective M easure
For the objection detection task, an example of the task is shown in figure 3.17a,
one image was presented at eye level on the center monitor of the perception lab. By
the end of this session each participant viewed 10 of the total 50 images. One image
generated from each of the 10 HDR scenes; he/she did not see the same scene more
than once. The choice in images used was controlled so that every tone mapping
algorithm had generated 2 of the 10 images shown to the subject. Each image was
64
(a) Set7
(d) garage 2
(f) voile
(e) seriel
(j) moto
Figure 3.16: Images used in the first study. All images shown are compressed using
logarithmic compression.
displayed for a period of 60 seconds and the participants task during th at time was
to identify as many objects as possible in as much detail as they could. The list
65
of objects the participant could identify was open-ended; meaning any item in the
scene was potentially an object. Also the participants were asked to identify regions
that appeared to have no objects, featureless sections.
was algorithms that displayed more detail than others would allow participants to
identify more objects or would allow them to identify objects in greater detail than
others. Conversely lower performing algorithms would have more instances of blank
or featureless regions.
(a)
(b)
Figure 3.17: Example images from the Visual Perception Study, (a) An example im
age from the target detection task, (b) Is an example image of the paired comparison
task.
Each
test session was also video recorded for later analysis. The video recorder was placed
behind the observer, so th at the screen image and the mouse cursor were visible on the
recording, but the participants face was not recorded. The number of objects correctly
identified was determined from viewing the video tapes after testing. Participants did
66
not receive feedback on their performance during testing. A correctly identified object
was defined as any object that was selected using the mouse and verbally identified
by a statement that accurately described the selected object.
3.4.2.2.3
Subjective M easure
The second task, the subjective measure, was a paired comparison task, as shown
in figure 3.17b. For this task two images, of the same scene, but generated using
different tone mapping algorithms, were presented side by side at eye level. Both
images were presented on the center monitor. Images were displayed as a split image
on a single monitor so that video settings were consistent across the two images being
compared. Each participant had to make 100 comparisons to complete this task.
Every algorithm was paired with every other algorithm for a given scene, totaling 10
comparisons per scene. The 10 scenes and 10 comparisons per scene resulted in 100
comparisons. For the pairs of images shown the decision to show one image on the
right or the left of the TV was randomized. Participants were asked to compare pairs
of images and select the image they believed had the most amount of visible detail.
For this task we created a graphical user interface in MATLAB to present the images
and allow the participants to make a choice between the two images. This served as
the only record of the participants choices, as there was no video recording. Upon
selection the choice made and the time taken to make that, selection were recorded for
67
later analysis. This task was a forced choice, the participant had to select either the
left image or the right image and he/she was allowed to take as much time as needed
to make that selection, Participants were instructed to try to limit their selection
time to one minute or less, but this was only a suggestion. Selection times exceeding
one minute were possible and did occur occasionally.
3 .4 .2 .3
R esu lts
The dependent measure for the object detection task was the percentage of cor
rectly identified targets. For each image, the total number of possible targets was
determined by totaling the number of distinct objects correctly identified across all
participants, regardless of the algorithm. This number was used as the maximum
number of possible targets for a given image. The percentage of correctly identified
targets was calculated using the number of correctly identified objects for an individ
ual participant and the maximum number of targets for the corresponding image. A
mixed model analysis of variance indicated there were not any statistically significant
differences between algorithms for the object detection task. This may have happened
because participants would identify the most obvious objects first. These objects also
happened to be the objects th at showed up across algorithms. Participants rarely
identified objects that appeared when run through one algorithm, but not the other.
There were significant differences in the paired comparison task however, the re
68
suits of this analysis are tabulated in Table 3.1. For the paired comparison task the
results were collapsed across images to determine whether one algorithm was preferred
over another regardless of the scene th at was being viewed. This resulted in 300 com
parisons between each pair of algorithms and showed th at the bilateral filter, Retinex,
and photographic tone mapping algorithms were the most preferred. There was not
a statistically significant difference between the Retinex, bilateral filter, and photo
graphic tone mapping algorithms, as summarized in figure 3.18. The log adaptation
algorithm was preferred less than the bilateral filter, Retinex, and photographic tone
mapping algorithms. The histogram adjustment algorithm was the least preferred in
terms of amount of detail that appeared in the final image.
1 vs.
1 vs.
1 vs.
1 vs.
2 vs.
2 vs.
2 vs.
3 vs.
3 vs.
4 vs.
2
3
4
5
3
4
5
4
5
5
(1)
R e tin e x
(2)
H isto g ra m
A d ju s tm e n t
(3)
P h o to g ra p h ic
Tone
M ap
p in g
(4)
B ila te r a l F il
te r
(5)
L og A d a p ta
tio n
P -V a lu e
204
152
158
193
0
0
0
0
0
0
96
0
0
0
102
91
122
0
0
0
0
148
0
0
198
0
0
150
204
0
0
0
142
0
0
209
0
150
0
198
0
0
0
107
0
0
178
0
96
102
4.51E-10
0.817361
0.355611
6.86E-07
2.98E-08
9.58E-12
0.001224
1
4.51E-10
2.98E-08
Table 3.1: Paired Comparison Results and p-values collapsed across participants and
scenes.
69
.X ''
Histogram Adjustment
Log Adaptation
Retinex
Photographic Tone K
Bilateral Filte
Increased preference
Figure 3.18: Diagram summarizing the results of the subjective measure experiment
of the first psychophysical study. Algorithms are listed from left to right in order of
increased preference; in terms of how much visual information or detail seemed to be
visible by test subjects.
3 .4.2 .4
From the results of the first study we knew that our new tone mapping algorithm
would be a local one; however, the results did not shed any light on what type of
spatial filter our new algorithm should be based on. Almost all local tone mapping
operators use some type of spatial filter to estimate the local features in an image.
Since there appeared to be no perceptual differences we looked at the computational
differences between each algorithm. The Retinex tone mapping operator runs the
HDR image through three spatial filters, each at a different scale. Unfortunately, the
optimal spatial filter size varies based on image size and the spatial properties of the
image, but there is no known a priori method to determine what the spatial filter size
should be. So an image may need to be filtered at many different scales in order to
find the proper three spatial filters. The Photographic tone mapping operator, unlike
the Retinex operator, adapts to the spatial properties of each image by iterating
through a set of center-surround spatial filters of various sizes until an appropriate
70
one is found. This produces consistent results without much human intervention, but
multiple spatial filters may be tried before the correct one is found. The bilateral
filtering algorithm on the other hand uses only a single filter for the entire image.
The filter is an edge preserving spatial filter and so effectively adjusts its size based
on the spatial contrast properties of in the image, without iterating. The drawback of
using an edge preserving filter is th at it is significantly more computationally intensive
compared to the spatial filters used in the other algorithms. However, Durand and
Dorsey came up with a fast way to compute an approximate edge preserving spatial
filter that is no more computational complex than any regular spatial filter [23]. Their
method also allows the original HDR image to be sub-sampled up to a factor of 20,
which significantly reduces the computational cost of algorithm. For these reasons
we selected to improve and autom ate the bilateral filtering algorithm.
A drawback of the bilateral filtering algorithm is that depending on the spatial
and illumination properties of the HDR image, the final image may come out too
dark or too bright leaving large regions with a reduced amount of detail or without
any detail at all. W hether the tone mapped image is too bright or too dark is largely
dependent on the scaling factor, w, th at is used when combining the detail layer with
the illumination layer from equation 3.3.5. A simple solution then is to choose a
default value for w, and then simply linearly scale the pixel values of the resulting
image so that the brightest pixel is set to the highest illumination value, and the
71
darkest pixel to the darkest displayable value. Unfortunately this also often results in
sub-par images where images appear to be washed out due to a few extremely bright
or dark pixels. This is often the case when the illumination source, such as a light
bulb or the sun itself, appears in the image. The idea behind the automation, comes
from the fact that some of the above tone mapping algorithms, such as the histogram
adjustment and photographic tone mapping algorithm, make pictures look better by
sacrificing some of the visibility of the pixels in the images. Instead of trying to ensure
that every pixel is visible, the algorithm tries to ensure most of the pixels of the detail
layer are visible. This is done by controlling the value of the scaling factor, to, such
that the average pixel value of the resulting image L is in the middle of the range of
displayable illumination levels (by default we assume th a t value to be 0.5 or 128).
(a)
(b)
Figure 3.19: Comparison of the difference in detail between the original and auto
mated bilateral filtering algorithm, (a) Image generated using the bilateral filtering
tone mapping algorithm, (b) Image generated using the automated version of the
bilateral filter tone mapping algorithm to increase amount of visual information seen
by the viewer.
72
3.4.3
S tu d y II
3.4.3.1
P roced u re
(a) belg
(c) variphrase
Figure 3.20: Images added to the second study. All images shown are compressed
using logarithmic compression.
The procedure for the second study was very similar to the first study, but had the
following differences. For the object detection task the number of participants was
increased to 60 to try and increase the weight of the statistics per algorithm, but we
kept the number of participants for the paired comparison task to 30. We also changed
73
the payment of participants to a flat rate of $25 for participants who did both the
object detection and paired comparison task, and $15 for the participants who only
did the object detection task. We also changed the procedure of the object detection
task by creating a list of ten objects for each scene th at participants had to identify.
Each scene had its own list of ten objects and the participants had 30 seconds to try
and identify all the targets on the list. The lists were created by choosing targets
that appeared to show up well for some algorithms, but not for others.
For the second study we kept the four algorithms from the previous study th at
had the best performances, the Retinex, the photographic tone mapping, the bilateral
filter, and the logarithmic adaption algorithms. We also used the automated bilateral
filtering algorithm that we developed as the fifth algorithm. The apparent improve
ment, in terms of amount of visible detail, between the standard bilateral filter and
the modified bilateral filter is shown in Figure 3.19. For the second study we again
had ten source HDR images, six of which were used in the first study figure 3.16e-j.
The four images that were added to the study are shown in figure 3.20.
3.4.4
R esults
Unfortunately, again results from the object detection task have yet to yield sta
tistically significant results, so we have yet to show th at there are any objective
differences between these five algorithms. It is unclear whether the objects used for
74
identification simply werent sufficient to indicate the differences between the algo
rithms or that we still had an insufficient number of people to yield any significant
results. The results of the paired comparison, however, bore significant results. A
summary of the results are shown below, but the most important fact is th a t the
participants preferred the new algorithm over any of the other algorithms th at were
tested, which indicates th at the automated bilateral filtering algorithm appears to
show more detail than any of the other tone mapping algorithms. Also, again the
global tone mapping algorithm, the logarithmic adaption algorithm, was the least
preferred in terms of detail.
1 vs.
1 vs.
1 vs.
1 vs.
2 vs.
2 vs.
2 vs.
3 vs.
3 vs.
4 vs.
2
3
4
5
3
4
5
4
5
5
(1)
R e tin e x
(2)
Log
tio n
166
134
114
80
0
0
0
0
0
0
134
0
0
0
108
138
81
0
0
0
A dap-
(3)
P h o to g ra p h ic
Tone
m ap
p in g
(4)
B ila te r a l F il
te r
(5)
A u to m a te d
B ila te r a l
F ilte r
0
166
0
0
192
0
0
148
110
0
0
0
186
0
0
162
0
152
0
83
0
0
0
220
0
0
219
0
190
217
P -V a lu e
0.064672
0.064672
3.23E-05
6.66E-16
1.24E-06
0.165857
1.55E-15
0.817361
3.86E-06
1.02E-14
Table 3.2: Paired Comparison Results and p-values collapsed across participants and
scenes from the second study.
75
Retinex
Log Adaptation
Photographic Tone A
"v
Bilateral Filte
Increased preference
Figure 3.21: Diagram summarizing the results of the subjective measure from the
second psychophysical study. Algorithms are listed from left to right in order of
increased preference; in terms of how much visual information or detail seemed to be
visible by test subjects.
3.5
C onclusion
The HVS is a highly sensitive, adaptable, and sophisticated sensory system and as
the capabilities of imagers and displays improve we must be aware of the possibility
of overdesigning the sensor or display system. The ideal mobile vision system is
one that captures and presents as much information to the user as they can process
without exceeding those limits. This says that the sensor and display system should
have similar spatial resolutions, temporal resolutions, dynamic ranges, etc. as the
average users eyes. We must thereby understand what these capabilities are in order
to optimally design imager systems intended for human use. Also, by understanding
not only the capabilities, but the underlying physiology a simpler system can be
designed with negligible effective differences. The basic example being that the entire
visible spectrum can be reproduced using only three wavelengths of light, red, green,
and blue, at least as far as the HVS is concerned. Understanding the underlying
76
physiology of the eye is also what enables tone mapping algorithms to compress the
absolute intensity information of images with little noticeable differences in the final
image. By preserving all of the features that are picked up by the low level visual
processing cells in the retina most of the intensity information can be discarded. The
result of our study to develop a tone mapping algorithm that presented as much
visual information to the viewer as possible for a given dynamic range, 8-bits in our
case, showed that the automated version of the bilateral filtering algorithm appears
to present more information to the viewer than current tone mapping algorithms.
However, this does not prove th at it outputs the maximum amount of information.
3.6
D iscu ssion
In order to say whether the autom ated algorithm is maximal we need a quanti
tative measure of how much information is shown to the user. To do this we have
developed an information content quality assessment model in chapter 4 to estim ate
how much information is in a tone mapped image in order to be able to maximize
the amount of information that it presents to the user.
this model we can show how close to maximum our automated bilateral filter is and
possibly improve it further.
77
C hapter 4
Inform ation C o n ten t Q u ality
A ssessm en t
4.1
In tro d u ctio n
There have been several published methods on how to perceptually alter an image
to meet one of many goals, like to make the image more perceptually similar to
reality, to make the image more aesthetically pleasing, or for our purposes to increase
the amount of perceivable detail in the image [25,26]. But a difficulty lies in trying
to validate the performance of any of these methods. Namely the most accurate
way to judge how well methods works is to conduct a psychophysical study, where a
random group of people view images generated using one or more of these methods
78
and are tested or questioned on the perceptual properties of the generated images.
But psychophysical studies can take a long time to complete, they can be expensive,
they may require a large number of people for the results to have any meaning, and
the studies axe often limited in the amount of quantitative data produced.
It would be better if a computer model could evaluate the perceptual qualities of
an image in a manner consistent with human perception and psychophysical data.
There may be a slight loss in accuracy in terms of how well the model actually
matches human perception, but it would significantly reduce the tim e required for
model evaluation and more quantitative data would result from the evaluation. For
instance in the psychophysical study described in the previous chapter the only d ata
that resulted from both studies was a rank ordering of the amount of detail perceivable
using the tested HDR compression algorithms. A computer model of perception, an
image quality metric, could produce a numeric comparison of the perceptual detail
between two images which would not only produce an ordering of which methods on
average resulted in the most amount of perceivable detail, but also give a numeric
comparison of how much more perceivable detail a method would produce, on average.
An image quality metric could also produce perceptual difference maps to display
where in an image the largest perceptual differences lie between two images. Also,
the time consuming nature of these studies generally limits their use until an algorithm
has been fully developed and is only generally good for verifying th a t the algorithm
79
4.2
80
4.2.1
M odel T yp es
Most QA models only deal with monochromatic images as the area of judging
the perceptual similarity of colors and the appearance of colors in complex images
is largely its own field of research with some models th at combine the two such as
iCAM [56,57,59] , but as the intended application was NVG which is monochromatic
we limit our discussion to luminance based QA metrics.
The ground-truth assumption a QA metric is based on whether the metric assumes
th at some, all, or none of the d ata of a perfect or ideal image is available for use in
the comparison. QA models that assume that an ideal/perfect image is available for
comparison with a second presumably distorted or degraded image are categorized as
being full reference models, which is the most common situation [21,54,55,61-68,70].
If the ideal image is unavailable or unused, like when comparing a digital image with
the real world version of it, the metric is a no reference model. If only portions,
81
a subset, or just statistical information about the ideal image is available then the
metric is a partial reference or reduced reference metric [58].
Quality Assessment metrics can generally be grouped into one of two categories,
bottom-up models and top-down models. Bottom-up models like the visual differences
predictor (VDP) [55] and the Sarnoff Just-Noticeable-Differences (JND) model [71]
try to analyze an image by modeling, how the image is acquired by the human vi
sual system (HVS) and how the image information is processed through the visual
pathway. These generally use models th at account for the viewing environment, the
scattering of light in the eye, and the visual processing th at goes on within the retina.
Top-down metrics such as the Structural Similarity Index Measurement (SSIM) [68 ]
and the Visual Information Fidelity (VTF) [65,66] metric, try to judge an image us
ing high level ideas such as the statistical structure in an image or the information
content in an image. These metrics attem pt to model the high level analysis th at
may go on in the brain when viewing an image. Historically, though the bottom up
approach is closer to biology, top-down metrics such as the SSIM and VIF metrics
often make judgments about th at are much closer to the judgments that a person
would make. These top-down approaches while they are insensitive to noise, shifts,
and some rotations they do not handle high dynamic range images well.
82
4 .2.1.1
B o tto m -U p M o d els
4.2.1.1.1
The Visual differences predictor (VDP) metric by Scott Daly [55] is a well-known
full reference bottom-up model that is capable of detecting many kinds of perceptual
distortions. The metric takes in two digital images as an input, one source image
and one reference image, and outputs a third image indicating where any differences
between the two images are perceivable. The way the metric creates the visible differ
ences map is as follows. For the two images the VDP models how the images appear
to a person, by taking into account the display parameters, pixel size, pitch, display
resolution, display intensity range, and viewing distance. It also models how the dis
played images are processed by different stages of the HVS. The HVS model takes
into account changes in sensitivity due to illumination level (amplitude compression),
the contrast sensitivity function (CSF) of the HVS, the frequency selectivity of the
HVS, and decreases in sensitivity due to masking effects. The final stage involves us
ing psychometric functions to determine whether differences are noticeable between
the two images. The likelihood that each distortion is perceivable is then plotted in
the visible differences map.
The benefit of the model is th at it takes into account many visual processes th at
affect the perception of differences between the two images. However, the distortion
map only identifies locations where differences between the two images can be per
83
ceived. So it gives no numeric assignment of how strong the distortions are once they
are above threshold, nor can it be reduced to a single value descriptor to indicate
overall how much distortion is present in the image.
4.2.1.1.2
84
nance, u*, and v* images (sequence of images) each then undergo Gaussian pyramid
decomposition generating several subbands. Each subband in the luminance channel
is then normalized to model the insensitivity of the HVS to overall light level and to
model the loss of sensitivity to the time dependent adaption to changing light levels
for video. After normalization, oriented contrast maps and a flicker contrast map are
created for each subband. For the u* and v* channels chromatic contrast maps are
also made. Then each map is fed into a contrast energy mask to convert the maps to
JND units. The JND maps for each feature channel and for each image (sequence)
are fed into a difference metric to create a luminance JND map and a chroma JND
map.
The benefits of the Sarnoff model are th at it can detect perceivable differences
in monochromatic or color images and videos and since the full JND m ap is in JND
units the Sarnoff JND model indicates more than just which locations have visibly
noticeable differences; the larger the JND value the more subjective visual difference
can be seen [71].
4.2.1.1.3
The high dynamic range visual difference predictor (HDR-VDP) [61, 62] is an
extension of the VDP model to specifically handle comparing a LDR image with a
HDR image. Like the original VDP model the HDR-VDP model takes in two images,
85
a HDR image and a tone mapped LDR image, and outputs a visible differences map
indicating the likelihood that a distortion would be perceived. Also, like the VDP
metric the HDR-VDP models how the LDR and HDR image would appear on a
display to a viewer and it uses a simple model of the HVS to determine how the
two images would be perceived. The difference in the HDR-VDP is th at it adds an
optical transfer function (OTF) as the first stage of the HVS model to model the
scattering of light as it passes through the cornea, lens, and retina. It also allows
the HDR and LDR image to be compared in a similar manner. Otherwise the same
elements of the HVS are modeled in both metrics, the amplitude compression, the
CSF of the HVS, the frequency selectivity of the HVS, and decreases in sensitivity
due to masking effects. However, the implementation of the amplitude compression
and CSF are changed in the HDR-VDP. In the HDR-VDP amplitude compression is
implemented in terms of JND units. In the original VDP JND units were used, but
not until after the CSF filters were used, so in the HDR-VDP conversion to JND no
longer happens in the CSF step.
Like the VDP metric the HDR-VDP has similar benefits and drawbacks. Since it
models so much of the HVS it is very bio-fidelic, but the visible difference map only
identifies locations of perceivable difference without indicating any further information
once the difference is past the perception threshold. But an added drawback of the
metric is that validation can only be done for HDR images that have dynamic ranges
86
4.2.1.2
T op-D ow n M od els
4.2.1.2.1
The top-down image quality assessment metric by Hautiere et al. [58] is an example
of a no-reference QA metric. Though it is a no-reference metric and doesnt depend
on the idea that an ideal image is available, it still takes two images as an input and
assumes one is source and the other has been altered. The essence of the model is
that for the altered image the location of all the edges in that image are identified and
then the gradients of those edges are measured. In the original image the gradients at
87
those same locations are also measured. The metric then measures the visible change
in the gradients. From this a map can be created showing the location of visible
increases and decreases in edge gradients. The metric also has three types of single
value descriptors, the improvement of the number of newly visible edges (e) 4.2.1, the
geometric mean of the ratio of edge gradients (f) 4.2.2, and the increase in number
of saturated pixels in the distorted image (a) 4.2.3.
(4.2.1)
Where nr and n0 are the number of visible edges in the altered image Ir and original
image Ia, respectively.
The value r represents the change in edge gradients between the altered and
original image.
Ti =
AIr
A l,O
and A Ir and A I a are the gradients in the visible edges of the altered and original
image, respectively.
r = exp(
log(r i))
nr ~'
88
(4.2.2)
(7 =
w x h
(4.2.3)
Where na is the number of saturated black or white pixels in the altered image that
were not saturated in the original image and w x y is the width and height of the
image.
The no reference quality assessment model by Hautiere et al., was developed to
detect, measure, and rate the improvement of foggy images by different contrast
enhancement methods. The benefits of the model are th at it produces simple single
value judgments about the improvement in the visual quality of images and it can
be used to evaluate different types of image enhancement methods including tone
mapping methods. It is also possible for the metric to create a map not only showing
where contrast was enhanced or reduced, but also by how much. But only the r values
can be used in the mapping function. One drawback of the model is also a feature,
which is its simplicity. The metric is primarily based around the detection of edges
and the strength of edge gradients, but that is not the only measurement of the quality
of an image. Other measurements are the amount of noise in an image, the amount
of image artifacts like jpeg compression, halos, reverse halos, and ghosting. Another
drawback of the metric is when it is used as a QA metric for tone mapped images.
Because everything is done in ratios where one image is assumed to be the source,
when comparing different tone mapped images the source cannot be used because
89
4.2.1.2.2
The Structural Similarity Index (SSIM) [68] and the related Multi-Scale Structural
Similarity Index (MS-SSIM) [70] are two of the most well-known and two of the
best full-reference quality assessment metrics. The SSIM metric assumes th at all
the computation in the HVS used when comparing two images is used to identify
and locate structural distortions between the two images. The model then outputs
a map indicating locations of identified structural distortions. To identify structural
distortions, the SSIM metric computes the difference of three properties in each image,
local differences in luminance, differences in contrast, and local structural differences.
The product of each local difference is then computed to give the local SSIM at a
particular location. The SSIM metric is only applied at a single scale, but for the MSSSIM metric both input images under multiscale decomposition. In the multiscale
decomposition both images are simply down-sampled and blurred. The regular SSIM
metric is then applied at each scale and then all the scales are multiplied together,
with the necessary up-sampling or down-sampling needed to keep everything aligned.
The single value descriptor of the metric is simply the mean of the SSIM map.
90
l{x ,y)=
2^ y + Cl
+ 1*1 + Cl
Where y x and y y are the average intensity values in image x and y . C\ is a constant
designed to keep the luminance from getting too large if fxx and fxy are too small.
Contrast differences c(x, y ) are calculated using:
c(x, y ) =
2axa y + C2
a x + (7y + C 2
Where ax and a y are the standard deviations of the intensities in image x and y ,
respectively. C2 is a constant to prevent the denominator from getting too close to
zero.
The Strucutral difference s ( x , y ) are calculated as follows:
s(x,y) =
x y t" C 3
0x(Ty + C 3
91
model match well with human judgments on the amount of distortion in images with
several types of distortions. Another benefit is the distortion maps created by the
SSIM metric do not just indicate locations of perceivable differences or JND, but they
also give ratios of how much distortion seems to be present. However, because no
aspects of the HVS are directly modeled, such as contrast sensitivity when standing at
a particular distance for a display, it is unclear how these results can help understand
perception in specific situations. It is also beat out in judgments of distortion by the
Visual Information Fidelity metric [65], though it is the second best performing QA
metric.
4.2.1.2.3
The visual information fidelity (VIF) [65,66] metric is also one of the most wellknown and best performing full-reference QA metrics for low dynamic range images.
The VIF metric is an information theoretic QA metric that measures the distor
tion between the original image and the altered image by calculating the Mutual
Information between the ideal image, C , and the distorted image, D . The Mutual
Information between two images is the amount of information in the ideal image th at
is also present in the distorted version of the image. The VIF metric assumes th a t the
ideal image and the distorted image are related through a distortion channel, figure
4.1. It also assumes that when the ideal image and distorted image are viewed by the
92
HVS, E and F , respectively, some information is lost. It models this loss by adding
a white noise source N and N ' to the ideal and distorted images, respectively. The
VIF metric, like all the QA metrics listed before it, can create a VIF map identifying
locations of distortions between the distorted and ideal image. It can also output
a single VIF descriptor value, which is the ratio of mutual information between the
source image and the perception of the distorted image by the HVS against the mu
tual information between the source image and the perception of the source image by
the HVS 4.2.4.
HVS
E=C+N
Natural Image
Distortion
Scene
Channel
HVS
F=D + N
Figure 4.1: System Diagram of the VIF metric. Adapted from [65]
M I(C ;F \sy
VIF =
(4.2.4)
j su b b an d s
The VIF metric computes Mutual Information in the wavelet domain of the two
images. To compute the metric first the source and distorted image are decomposed in
the wavelet domain using the steerable pyramid filter. This creates multiple subbands
for each image. A single subband is indicated by the symbol C for the original image.
93
For each block of pixels in a subband of the original image, a linear fit is calculated
to match the corresponding block in the subband of the distorted image 4.2.5. This
is done for all blocks in a subband over all subbands. For the distorted image and
the original image a simple Gaussian white noise source is added to model the loss of
information when an image is viewed by the HVS. Thereby the M utual Information
of the original image and the perception of th a t image by the HVS is simply the
mutual information of the original image with itself where a Gaussian white noise
source has been added 4.2.6. The Mutual Information between the source image and
the perception of the distorted image by the HVS is the mutual information between
the source image and the source image multiplied by the linear fit to approximate the
distorted image and with a Gaussian white noise source added 4.2.7.
D = gC + V
(4.2.5)
M I ( C , E ) = J(C \ E \a)
N
(4.2.6)
i=l fe=l
94
M I { C , F ) = I{C,F\a)
-
2 t i t :
o o\
<
( -7 )
Where M is the number of entries in each block C and D and N is the number
of blocks in a subband, such th at N x M = the number of pixels in a given subband
The benefit of the VIF metric is th at it has been shown to be the best quality
assessment metric for all common distortion types, jpeg compression artifacts, Gaus
sian blur, and white noise [65,66]. It also requires very little param eter tuning or
calibration to work. The output of the model is usually in the range of [0, 1]. The
model outputs a 0 if there is no information about the ideal image in the distorted
image and 1 if the distorted image is exactly the ideal image. However, the model
can a number greater than one in certain cases. The model is biased to assume th at
the ideal image has a higher amount of mutual information between it and a noisier
version of itself, but it is possible, if the distorted image has been enhanced, for the
distorted image to have a higher mutual information value, resulting in a VIF > 1.
But this model is highly biased towards the ideal image, so it is not guaranteed th at
as long as the distorted image has more information th at it will be identified as such.
For instance, if an ideal image is compared against a distorted image it may have VIF
95
value < 1 saying the distorted image is distorted. However, if the distorted image is
treated as ideal and the ideal image treated as distorted it is not definite th at the VIF
value will be > 1 saying that the ideal image has more information. Also, while the
VIF metric is more accurate, it is more computationally intensive than the next best
metric, the SSIM and MSSIM metrics requiring about ten times as much computation
to run [65].
96
4.3
There are two approaches than can be taken when developing a QA model to
compare the amount of visible detail in tone mapped images. One way is to approach
the problem as a FR QA metric th at compares the original HDR image with its LDR
tone mapped version. This is a more natural approach as the HDR image is the
true original. The difficulty in this approach is that HDR and LDR images exist in
two different spaces. The only way to compare them is first put them into the same
space and the only known way to do this is by emulating the visual pathway, which is
most likely why all HDR QA methods are bottom-up. Any other m ethod is akin to
a tone mapping algorithm, which is the thing th a t is under test. Also verification of
any judgments a HDR QA metric makes is difficult, HDR images can have dynamic
ranges that span over 14 orders of magnitude, while even the best high dynamic
range television only spans 5 orders of magnitude, limiting which HDR images can
be used in verification trials. The other approach is to use a NR QA method th at
simply compares two tone mapped images to judge which one has the most amount
of information. This is a simpler approach, since both images are already in the
same space no conversion or alteration to the images is needed. The drawback of
this method is that, while we can define some measure of information content there
97
is no verification th at the information present in one of the tone m apped images was
actually present in the original HDR image.
Despite this issue we have chosen to develop a NR QA metric to evaluate the
amount of information or detail in tone mapped images, for the sake of maximizing
the amount of visible detail perceived by the HVS, and like other QA metrics our
model operates solely in the luminance domain. We made this decision for several
reasons: First, historically top-down methods tend to have a better performance as
QA metrics than bottom-up approaches. So while bottom-up metrics are more true
to the biology top-down approaches tend to get at the underlying goal of perception
a little better.
compare different tone mapped images based on the amount of information or detail
they contain. Thirdly, even though the HDR image is the original version it is not
necessarily the ideal image. It is possible to enhance details th a t are not visible or
easily visible from the HDR image so th at they are visible in the LDR image. Treating
the HDR image as the ideal would cause any enhancement of detail above what is
present in the HDR image to be seen as an error or even as a loss of information.
We do not want to limit a tone mapping algorithm by this constraint and while it
is possible for a tone mapped image to contain artifacts that are not in the original
HDR image we do prefer to promote higher information content even though this
may also create image artifacts.
98
4.3.1
M odel B ackground
In the development of our model we take an approach similar to the VIF metric,
but extend it to no reference situations by not assuming that one of the two images
is perfect and by measuring entropy instead of Mutual Information. Like the VIF
metric our model is based on the statistical models of natural images. We model
natural images in the wavelet domain using Gaussian scale mixtures (GSM). Using
GSM models the highly non-normal statistical properties of natural images can be
preserved, while allowing us to estimate the entropy in images using something like
a Gaussian probability distribution.
4.3.1.1
When we speak of natural images we refer to images of real objects, like cars, trees,
buildings, clouds, or mountains. Natural images form a distinct, but tiny subset of
the set of all possible types of images th at can be represented, such as images of
radar signals, sonar signals, text, random noise, etc., and they have certain statistical
properties such as, scale invariance, a 1/f spectral distribution, and stationarity [72,
73].
[72,73].
99
10"3
-4 0 0
300
-2 0 0
-1 0 0
100
200
300
400
Wavelet Coefficients
Figure 4.2: Empirical log histogram of wavelet coefficients of a natural image from a
single subband of the wavelet decomposition of the image.
4.3.1.2
Bell and Sejnowski [74] have investigated what is the natural basis set for natural
scenes and found that applying independent component analysis (ICA) to a large
set of natural images produces spatially oriented localized filters [75]. These filters
resemble the simple cells of the visual cortex and suggest that modeling natural images
may be equivalent to modeling the visual pathway [76]. Image decomposition using
spatially localized oriented filters is analogous to wavelet decomposition, which is a
popular method for image decomposition in image analysis and image compression.
For either filter type, the distribution of coefficient outputs for a natural scene is
highly kurtotic, zero-meaned, and heavy-tailed. The distribution of these coefficients
can be modeled as a Gaussian scale mixture (GSM) [72].
100
4.3 .1 .3
G aussian Scale M ix tu re M od el
Each coefficient
distribution of
V.
given S is then p ( D \ S ) ~
A f ( 0,
S C y ) , where
4.3.2
Q uality A ssessm en t M od el
luminance domain the first thing we do is to convert any color image to intensity
only. Then we apply wavelet decomposition to each image, creating several subbands
at different scales and orientations. For each subband we calculate the entropy for
each coefficient in that subband. Finally we sum each entropy subband over all scales
and orientations to get an estimate of the information content in each image. The
difference in Information Content values between each image then identifies which
image has the most information.
101
Image
Summation
I
Information Content
Figure 4.3: Block Diagram of the Quality Assessment Model.
4.3.2.1
Gaussian derivatives are also capable of handling more complex orientation patterns
such as corners, t-junctions, and even simple textures [77]. For the analysis done in
section 4.3.3 we use 5th order Gaussian derivative filters at 6 different orientations. For
a 5th order Gaussian derivative filters 6 equally spaced orientation directions provide
a basis set of filters [77].
4.3.2.2
E ntropy C o m p u tation
After the image has been decomposed we calculate the entropy for each subband of
the wavelet decomposition. We define a subband as of the output of each oriented filter
at a given scale. The entropy we calculate is the Shannon Entropy from Information
Theory, which can be stated as:
H(D ) =
5>(.D)/09(p(r>))
(4.3.1)
103
(4.3.2)
Where S' is a random variable and V* ~ A/"(0, C y ) is a zero mean Gaussian random
vector with variance
Cy.
H (D ) =
H 0 i ) = Y , H j* A )Io j(p (A ))
i= l
(4.3.3)
t= l
and
OO
p(Di) = j p ( D i \S ) p (S ) d S
(4.3.4)
o
Where p ( D i \ S ) ~ A/(0, S C y ) - However, there are no well-established
models
for p (S ), but based on the results of work by Portilla et al. [78] we use the Jeffrey
non-informative prior, which states that:
p(S) ~
(4.3.5)
Since it is not a proper distribution we must limit the integration to [smin, oo] and
let p(S ) = 0 when S < s min. However, with this we are able to calculate the entropy
of D for eachsubband of the wavelet decomposition. This produces an entropy map
for each subband of the wavelet decomposition.
104
Mathematically all the equations presented above are correct, but several details
in the math would prevent actual implementation or solving the previous equations
would take an impractically long time to calculate.
N u m erica l A p p ro x im a tio n
Technically the entropy calculation in equation 4.3.3 is correct However, the probability distribution of Di is analytically unsolvable. Numerically it is also impractical
due to the curse of dimensionality. The smallest window for the vector Di is usually
3x3, to keep the center of the window on a coefficient rather in between coefficients.
This results in a 9 D Gaussian th at must be integrated 4.3.4. In order to get an
answer in a more practical time frame we must approximate the entropy of >. First
we approximate:
log(p(Di )) = log ( ]T p (A |s ) p ( s )
(4.3.8)
105
H(D i) -
f
&& ,c--s
1
"2
d te D i
-A/(0, SminCy)^
(4.3.9)
\ s min
(sm in^v)
di
1
log(2K\srninC y \ )
log[smjn)
ses
(4.3.10)
M
( ^ )
l d[fi p{ di \s)p{s) +
j=l
s e S die D i i =1
(4.3.11)
M
log^Smin)
j =1
In 4.3.11 we assume C
7?
y ^ ^ ( 2 7riSjnin Aj | )
i= 1
(4.3.12)
matrix Q. We also define the elements of vector d,as di = {dti, di2, di3n . . . , diMi } .
106
/c = jgsubbands
E i= l
E
E ^E ( Smin,g^j )
(4.3.13)
(^i) T ^9^min,g)
log( 2tt\ S m i n , g ^ j |)
(4.3.14)
4.3.3
To test our QA model, we applied the model to all of the tone mapped images
used in the second psychophysical experiment, described in chapter 3.
This will
allow us to compare our models evaluation of which tone mapped image had the
most amount of information against the subjective d ata collected from the second
psychophysical study. For the sake of comparing the output of the model to human
data we collapse all of the entropy information to give a single value descriptor of
the amount of information contained in each image. The single value measurement
is computed through a simple summation of all the entropy coefficients at all scales
and over all feature channels. The result of this test is tabulated in table 4.1. The
image names correspond to the names used in figures 3.16, and 3.20.
107
4.3.3.1
Initial T ests
The results of the psychophysical experiments from chapter 3 found that the au
tomated version of the bilateral filter algorithm was the most preferred tone mapping
method. Similarly, the results of table 4.1 shows that our information content quality
assessment model also identifies the autom ated bilateral filtering algorithm as the
algorithm that has the highest amount of visual information after tone mapping, for
almost all images.
xlO9
A u to
B ilateral
B ilateral
F ilter
5 . 781 *
0.965
9.385
Belg
Seriel
N ancy _church2
Centersportif
Moto
Variphrase
Colorcube
Chantal-run
Brookhouse
Garage 2
5i o: i
15.129
7.5281
25.197
6.886
11.031
15.7
10.931
1. 5.532
4.063
3.0821
7.973
4.089
5.703
5.149
5.197
2.011
P h o to
graphic
T on e
M apping
2.781
10.187
5.94
3.1714
11.84
H istogram
A cijustm ent
Log
A d a p tio n
R e tin e x
2.0242
13.157
7.208
2.1906
9.479
4.851
3.271
13.539
9.616
6.697
10.99
14.256
2.5464
1.6044
12.048
2.2737
11.166
8.797
4.755
7.913
7.512
15.188
2.9154
6.744
16.051
4.2153
5.292
3.808
2.7987
9.343
4.176
6.961
5.704
9.628
2.0275
Table 4.1: Quality Assessment entropy measure of images used in the second psy
chophysical experiment.
Figure 4.4: Average rank order for tested tone mapping algorithms.
In order to judge the relative performance of each algorithm as judged by our
model, we compute the rank of each tone mapping method for a given source image.
A 6 is given to the tone mapping method with the highest estimated entropy and a
1 to the method with the lowest entropy, for a given image. We then compute the
average rank for each tone mapping method over all images, the results are shown
in figure 4.4. From the average rankings the model identifies the photographic tone
reproduction operator as having the next highest entropy, while both the Retinex and
the original bilateral filter algorithm are identified as having the least total entropy,
109
4.3.3.2
L ooking Forward
For future tests we will more deeply investigate how our model judges tone mapped
images at different scales and orientations to see if these feature subsets or combi
nations of these subsets correlate better recorded psychophysical data. A learning
algorithm used to determine the weights of different scales and orientations may im
prove our results and shed some light on the preference of certain scales or orientations
used when judging the amount of perceived detail in an image.
110
C hapter 5
Ideal O bserver M o d el
5.1
In trod u ction
The fovea occupies a small region of the field of view of our eyes. In order to see
everything in our environment at the highest resolution of our eyesight, we must view
it with our fovea. In order to do this we must move our eyes, saccade, over the entire
scene. We do this constantly to look for and process new information. However, we
do not move our eyes randomly, nor do our eyes move in some predefined pattern
independent of the environment we are in. Our eye movement is strongly affected by
the environment we are in, but it can be difficult to predict exactly where a persons
eyes will look. There are many theories th at try to explain the factors th at influence
eye motion, more specifically human visual fixation [29,79-96].
Ill
One set of theories comes out of information theory and says th at eye motion is
based on information maximization [94,95]. A persons eyes fixate on different lo
cations in a scene, in order to gain the maximum amount of information from th at
scene, with the least amount of eye movement. In these theories, information is quan
tified as either self-information or entropy. This idea is very similar to the information
content quality assessment model of the previous chapter, where detail was quantified
as entropy. Thereby, detail and information may be considered equivalent, in term s
of our model. Our model that already estimates the amount of information at each
location in an image may then also be used to predict fixation. This serves as a way
to further validate our entropy based approach as a way to model human perception.
Thus, entropy may not only serve as a way to predict the judgments people make
about the amount of visible information in an image; it may also serve as a way to
predict fixation. In this chapter we show how our entropy based model performs when
applied to the problem of predicting visual fixation.
Fixation occurs when a person gazes at a point or a location for a given period
of time. Locations with unique differences from their surroundings are more likely
to grab the attention of the viewer first, as evidenced by the location of fixation.
In more complex scenes, where multiple regions have features that differ from their
surroundings, it is less clear what features or cues affect the probability th at a person
will look at one location over another. By understanding the features in an image
112
5.2
person is searching an image for a green ball they may ignore any regions without
green. Top-Down processing is also needed for object recognition and classification
and can help in some aspects of segmentation [79,82,83]. Bottom-up influences are
113
more sensory based factors th at affect fixation such as spatial or tem poral changes in
contrast, texture, intensity, or color, like a red rose in a field of white roses. Top-down
influences generally require some sort of learning or training th a t require the use of
large databases of positive and negative example objects or patterns to work. We
would prefer to keep our model simple and to keep it idealized w ithout using specific
examples of real things. So for our saliency model we have included no top-down
aspects and limit any further discussion and comparisons to models th at only use
bottom-up influences.
5.2.1
Most bottom-up models of fixation take the natural approach to modeling fixation,
by basing some or all of the components in their models on the neural circuitry
involved in fixation [29,84-88]. The theory of how eye fixation works, on which these
models are based, comes from a paper by Treisman and Gelade [89] and another
paper by Koch and Ullman [90]. Treisman and Gelade proposed th a t the extraction
of features from a scene or image occur in a parallel fashion, where all the features in
an image are extracted in one go, while attention is a serial process and must jum p
from one region in the scene to another. Koch and Ullman expand upon this theory by
proposing the idea that the movement of attention from one part of a scene to another
occurs as a winner-take-all (WTA) process and they first coined the term of saliency
114
map. There were several features initially thought to be extracted by Treisman and
Gelade [89], however most biologically inspired bottom-up saliency models extract
only three types of features, intensity, color, and orientation, at multiple spatial
scales [27,29,85,87]. Other features that have been extracted along with the standard
three are faces [88] and local entropy [86].
The difference between the various models is usually defined by how the features
are processed after extraction and how the feature maps are combined into the final
saliency map. Itti and Koch [29,85] pass each feature map through a center surround
filter at all spatial scales, they normalize each filtered map, then average across scale,
and then average across feature to create a single saliency map. The model by Itti,
Koch, and Niebur [29], often referred to simply as the Itti model, has become the defacto standard model of fixation and whose results all subsequent saliency modelers
compare their results against. The model by Zhao and Koch [88] uses across scale
averaging and a weighted average across features; where the value of the weights
was learned in order to maximize the correlation between the generated saliency
map and human fixation data. Lin et al. [86] apply a difference of Gaussian (DoG)
center surround filter to each feature channel and then compute the earth mover
distance (EMD), using weighted histograms, to compute the difference between the
center and surround. A feature subset is then created by combining different feature
channels using Minkowski summation [100,101], these feature subsets are then fed
115
116
5.2.2
Another approach that has been used to develop saliency models is to look at
fixation from an information theoretic perspective. The underlying idea of most of
these models is that when a person looks at a new image or scene, they want to learn
as much about that scene or image as quickly as possible. So the locations th at are
fixated on first are the ones th at have the most information. In these models infor
mation is typically measured as entropy [91-93] or self-information [94-96]. These
concepts are often confused with one another and while they are related they have
distinct differences. Entropy is a measure of the uncertainty in the outcome of a
random variable, while self-information is a measure of the likelihood of the outcome
of a random variable. The more uncertain or unpredictable the outcome of a random
variable is the higher the entropy. As it applies to images, the higher the uncertainty
of the exact value of a pixel or region of pixels, the higher the entropy and thereby
more information is gained when the value of that region is known. Meanwhile the
lower the likelihood is that a pixel or region of pixels has a certain value, the higher
the self-information when that value is known.
Both measures are related concepts, since entropy is defined as the expected value
of the self-information. The confusion arises from the fact that the m utual information
of a random variable with itself can also be called the self-information of that variable,
but this definition is also equivalent to the entropy of that variable. This is shown
117
(5.2.1)
The main source of differences between saliency models based in information the
ory is how they create their feature maps and how they estimate the probability
density function (pdf) of the variable over which they are computing entropy or selfinformation. Models that estimate salience through entropy [91,92] usually calculate
the local entropy of image patches in the intensity domain. Kadir and Brady [91] esti
mate the probability distribution function of the intensity values using the histogram
of local image patches. It is possible to measure entropy solely in the intensity domain
without extracting features from the image. However, transforming the image into the
wavelet or spectral domain, can offer a richer set of values, and unique coefficients are
more closely related to saliency in these domains, than in the intensity domain. Kadir
and Brady also do not implement a multi-scale decomposition, but instead determine
the best scale for each pixel. This limits the design from being parallelized and in
order to determine the best scale the pdf from the histogram must be calculated for
multiple scales until the optimal scale is found. Tamayo and Traver [92] implement
the model by Kadir and Brady, but they first convert the source image into log-polar
118
computationally intensive as the size of the network grows. So networks must usually
be kept small limiting the resolution of features that the model can pick up.
Bruce and Tsotsos [94,95] compute the probability of each coefficient output by
one of their basis filters, by taking the histogram of the surrounding coefficients output
by that same filter. This means that they must compute the pdf for every coefficient
119
output by each of the 54 basis set of image filters th at they use. However, using
principal component analysis they know that using all 54 image filters allows them
to retain 95% variance, ensuring that they detect almost all of the possibly relevant
features from every image. Zhang et al., [96] precompute both the basis set of image
filters and the probability distribution for each basis filter. However, they use more
basis filters than both of the previous models combined, 396.
Not all information theoretic models assume that all the information in an image
is of equal value. Gao and Vasconcelos [79] developed an information theoretic model
of visual salience that proposes that a specific type of saliency m ap can be used as
a preprocessing step for object recognition. In their model a saliency map is a 2-D
map of the likelihood that each location, based on the local features present, contains
a particular object or class of interest. This is a different interpretation of salience
than what most models use and to show this difference they call their interpretation
of saliency discriminant saliency. The discriminant saliency of a location is then the
ability to discriminant an object or class of interest from the background, defined as
the set of all other known classes or objects, at that location. In their model at each
location in an image they estimate the likelihood th at each type of feature response is
a discriminating indicator that the object or class of interest is present by calculating
the mutual information between each feature response and the class of interest. This
model is essentially a top-down model, but the basic framework has also been applied
120
121
and use a Bayesian definition of surprise to generate a saliency m ap for a given image.
The model by Itti and Baldi is much like the Itti model, in that it extracts orientation,
color, and intensity, but it also extracts temporal effects like flicker and motion. From
the filter responses, it then estimates surprising information, in both space and time,
using the Kullback-Leibler divergence. It is a simple model th at extends the Itti
model to video and it has good performance, but it is unclear, from the paper, how
the model compares with other saliency models using only still images.
5.3
S alien cy M o d el D iagram
122
I Input Image
I Entropy Map
Saliency Map
Figure 5.1: Example of the Saliency Model applied to a natural image. The Spatial
Filters show sample entropy maps from different feature channels and orientations.
Our saliency map is color coded and is superimposed upon a black and white version
of the original source image, and only shows the most salient locations. The red
regions have highest saliency and as the color goes from cyan to, dark blue to light
blue the salience decreases. Source image from the MIT database [28].
implement the steerable pyramid architecture. We then calculate the entropy for
each filter output. However, as a saliency model, the primary concern is identifying
the regions with the most amount of information. To better separate these locations
123
from the background, the saliency model normalizes each entropy m ap and does a
weighted summation of the normalized entropy maps to create a single map th at
serves as the saliency map of the model, figures 5.1 and 5.2. We can also create
enumerated saliency maps, where the top regions are numbered based on their level
of salience, figure 5.3.
Image
4?
E ntropy Maps.
Weighted Summation
Salient
A ttention M ap
124
5.3.1
N orm alization
For the normalization we take the approach given by Itti et al. [29]. We normalize
the entropy map of each subband to a given range [0 X ] and multiply each entropy
map by (A ((locmax)))2. Where ((locmax)) is the mean of all the local maximums
for that entropy map and we set X to be 100. We then make all the normalized
entropy maps for each channel the same scale, we average the result across scales,
and do a weighted average across channels to create a single saliency map.
The
weights in the weighted average are chosen so th at the intensity channel has weight
equal to the combination of the color opponency channels, figure 5.2.
Saliency maps are used as a way to identify the most salient regions in an image.
They are usually used to understand why we may fixate at certain locations before
others or where people on average are most likely to look. In simple scenes one
region will have a much higher saliency than all other regions in the image, but
in complex scenes it is not always clear what is the most salient region versus the
second most salient region and so on. To clarify this, our model also enumerates the
top most salient regions, figure 5.3. This is similar to saliency models th a t implement
a saccading function that mimics the change in fixation of human observers [29,85].
To prevent labeling the same region multiple times after a pixel has been selected, the
saliency map is multiplied by an upside down Gaussian mask that ranges from [0 1]
and is centered at the selected pixel, figure 5.2. The selected pixel is thereby reduced
125
Enumerated Images
* i i
i 1v
r\
1 1 1 1 *
_ r
r
r
r
r rr _ >
Wr r
r I r
Fr
Tf
Figure 5.3: Demonstration of our saliency model numbering the top 10 most salient
locations. For these synthetic images the model identifies the most salient locations
as those where the features in th at region differ from the global average, orientation
in row 1 and size in rows 2 and 3. For these images our region is in the shape of a
Gaussian with a = 17 pixels, where the original image was 1024x768.
to zero and all nearby pixels reduced to something less than their original value. This
enumeration function is similar to the winner-take-all networks used to implement
selective attention in many biologically inspired saliency models [29,85,87], though
it does not try to increase the likelihood that the next location of attention is close
126
5.4
5.4.1
E valuation
Im age and F ixation D atab ase
5.4.2
Besides just using human fixation data to evaluate the saliency maps generated
by our model, we also compare the output of our model with the output of two other
saliency models. The two models we used were the saliency model proposed by Itti,
Koch, and Niebur [29], generally referred to as the Itti model, and the graph based
127
visual saliency model (GBVS) [30]. The Itti model has become a standard saliency
model that other saliency models compare their fixation predictions against.
The GBVS model has the highest performance th at we are aware of, of all other
bottom-up, feed-forward saliency models that require no learning or training during
operation or as a preprocessing step. The GBVS model has the highest performance
that we are aware of, of all other bottom-up, feed-forward saliency models th at require
no learning or training during operation or as a preprocessing step. Since our model
is also a bottom-up, feed-forward saliency model th at uses no learning or training,
the GBVS model is an ideal model to compare the performance of our model against.
The Itti model expands upon previous ideas of how fixation and attention operate
[89,90] and implements a computational model of fixation. First, the source image
is filtered, to extract orientation, intensity, and color at multiple spatial scales, to
create several feature maps. The feature maps are then filtered, using center-surround
filters, to identify changes in the different feature maps. Finally, each filtered m ap
is normalized, by multiplying each map by ( X ((localmax)))2, where X is the
maximum value of the map and localmax is the value of any locally maximum points in
the map. This emphasizes any filtered map that has a low number of local maximums,
while deemphasizing maps th at have a large number of local peaks. All normalized
maps are then averaged together to give a single saliency map.
The GBVS model proposes a new solution to the problem of isolating unique fea
128
tures in feature maps and combining feature maps into a single saliency map. The
GBVS model accomplishes this using Markov chains and graph theory. Like other
saliency models, the GBVS model first extracts basic features like, color, orientation,
and intensity at multiple scales, but rather than using center-surround filters to iso
late changes in the feature maps, the GBVS model establishes a bidirectional graph
from every pixel in a feature map to every other pixel in the m ap of th at feature
across all scales. The weight associated with each edge from one pixel to the next is
related to the difference between the pixel values and the distance between the pixels.
The weights are then normalized and used as transition probabilities in a Markov
chain. In the equilibrium condition, the mass at each location of the activation map,
indicates its dissimilarity from its neighbors. They repeat this process for the nor
malization step, but the weight at each edge is now related to the activation value of
the destination node and the distance between that node and the source node.
We obtained both models from the Koch lab website h t t p : //www. k l a b . c a l t e c h .
e d u /~ h a re l/sh a re /g b v s.p h p as part of the GBVS source code. In order to get a
fair comparison between our model, the Itti model, and the GBVS model we ensured
that the various settings and parameters of the Itti and GBVS model were kept
consistent with our own. For all of the models we operated at four spatial levels,
to ensure there were no differences in the multiscale decomposition. The 5th order
Gaussian derivative filters used in our model, result in spatial frequency selection at
129
the following orientations [0, 30, 60, 90, 120, and 150], and so for the Itti and GBVS
models we also extract these orientations, using Gabor filters. We also removed any
blur added to the final saliency map.
The ideal result of using any saliency model is a perfect prediction of human
fixation. However, due to top-down influences, the varying biases people have, and
the general differences between individuals when they view an image; the goal of
perfect accuracy is virtually impossible, without detailed knowledge of the individuals
viewing those images. We thereby assume th at the best any saliency model can
do is predict fixation locations with the same accuracy as the average inter-subject
predictability [30]. The inter-subject predictability is the level of accuracy th at the
fixation locations of one subject can be predicted using the fixation locations of all the
other subjects viewing that same image. We compute the inter-subject predictability,
by blurring the fixation locations of all the subjects with a Gaussian kernel, save one.
This is treated as a saliency map and run through one of the similarity measures
defined in the next section. The ground truth for the similarity measure is the fixation
points from the remaining subject. We then repeat this for all subjects and the
average result when predicting all the different subjects is treated as the best possible
value. In the following results all values are normalized by the result of the average
inter-subject predictability for each similarity measure.
130
5.4.3
M etrics
Numerous similarity metrics have been proposed in order to judge how well dif
ferent saliency maps can predict the fixation locations of human test subjects. The
most widely used similarity measure is the Receiver Operating Curve (ROC) [103].
ROC curves are graphs of the true positive rate vs. the false positive rate of how well
a saliency map predicts human fixation locations. To generate an ROC curve the
saliency map is viewed as a binary classifier. Any locations in the saliency map above
a given threshold are treated as fixation points and all locations below threshold are
treated as non-fixation points. The fixation locations recorded from human trials are
treated as ground truth fixation locations. By varying the value of the threshold, the
ratio of true positives to false positives changes, creating the ROC curve. Generally,
rather than showing the ROC curve the area under the curve (AUC) is measured to
use a single value as a measure of the performance of the saliency model. ROC curves
have the benefit of transformation invariance [96], but ROC curves only depend on
the rank order of the values in the saliency map and not the absolute or even relative
values of the saliency map [88]. It has been suggested th at multiple measures should
be used when comparing the results of different saliency models [88]. Other metrics
th at have been used are the normalized scanpath salience (NSS) [27], the KullbackLeibler divergence [102], correlation metrics [86,104], and the EMD [105] are some
of the more popular similarity measures th at have been proposed in evaluating the
131
determine the NSS, the saliency map is linearly normalized to have zero mean and
unit standard deviation. The values of the normalized saliency map at the ground
truth fixation points are then averaged. The larger the NSS value the more likely
that the saliency values are correlated to the ground tru th fixation points.
5.4.4
C enter bias
The main drawback of all of these methods, as they were originally proposed,
is their sensitivity to central bias. It has been well documented th a t fixation d ata
contains a strong central bias. There have been many proposed causes of this central
bias (photographers generally take pictures with the central object/subject in the
center, psychological tests usually have subjects focus on the center of the image in
between trials, and the center of the image is an efficient location to start between
images, and to minimize eye travel time) [104,106,107]. Whatever the reason for
the presence of this bias, it is still regularly found in the data, and to compensate
for this, many models add a central bias to the output of their saliency maps, to
132
better match the fixation data. For our model and the Itti model, we needed to add
a central bias to the saliency map output. We did this by multiplying the original
saliency map with a Gaussian kernel the size of the image. The GBVS model has an
inherent central bias and so needed no modification.
However, the real issue with center bias and the similarity measures as they were
originally proposed, is that simply treating a Gaussian kernel centered in the middle
of the image as a saliency map has been found to produce ROC AUC values of 0.80,
which is much higher than reported values of many well-known saliency models [96];
this is also true for the NSS measure.
To compensate for this, alterations to the ROC measure have been previously
proposed [96]. The alternative formulation of the ROC curve is to use only the values
of the saliency map that fall on ground tru th fixation points for the current image and
the ground truth fixation points of a different image as part of the binary classifier.
We have also found that using all the ground tru th fixation points from all the images,
without any repeated points, works just as well in removing the central bias sensitivity
and this is the formulation we use in the results section. This m ethod also increases
the total number of sample points used in the ROC calculation, from less than two
hundred points to over a thousand, which we feel may help in reducing any statistical
errors.
For the NSS measure there is no previously proposed alteration, so we propose
133
that to compensate for any center bias issues, the normalization of the saliency map
should only be done using saliency values th a t fall on the set of all ground tru th
locations from all the viewed images; again making sure no points are used twice.
The saliency values of the ground tru th fixation points for the current image are then
averaged.
The drawback of compensating for center bias is th at the central bias in fixation
data is real and by subtracting the effect of that bias from the d ata it has been
suggested that the metrics actually overcompensate or remove too much central bias
[96,108]. For this reason we use both types of similarity measures to judge the
saliency maps generated by our model, the Itti model, and the GBVS model. The
GBVS model, which already has an inherent central bias, can also be run with th at
inherent central bias removed, which is what we do in these central bias compensated
(CBC) metrics.
5.5
R esu lts
We have applied our model to numerous natural images and several synthetic
images and qualitatively it has promising results. For the simplest types of scenes, as
in figure 5.3, our saliency model chooses the region th at differ from the surrounding
area as the most salient, as one would expect, even though these images are artificial
134
and are not guaranteed to have the same properties as natural images. The location
of regions of high salience tends to fall on object edges and boundaries, since our
model uses only local features and does not try to move the point of fixation towards
the center of objects, although it still identifies features of the most salient object.
The saliency maps generated by our saliency model on more complex images, are
shown in figure 5.4, along with the saliency maps generated using the GBVS and Itti
model that are shown for comparison.
The quantitative evaluation of our model is done by comparing the human fixation
locations against the saliency maps generated by our model and the saliency maps
generated by the other comparative models, all of which is summarized in table 5.1.
We note that with the NSS metric, our model consistently outperforms both the Itti
and GBVS models, regardless of whether Central Bias Compensation (CBC) is used.
With the ROC metric however, our model has the best results with CBC, but without
it our model is only better than the Gaussian Kernel. The likely explanation of this is
that our model does not shy away from the edges of images. In our model features that
uniquely differ from their surroundings may become salient regardless of the proximity
of that feature to the edge of the image. This is demonstrated in our saliency map
of the cat image, where the edge of the tile floor is marked as a potentially salient
feature due to its difference from its surroundings. Human fixation data, on the other
hand, has a well-known center bias when looking at pictures [104,106,107], thus any
135
Original Image
Harrison &
EtienneCummings
Itti et al., 98
7*
Harel et al., 07
Figure 5.4: Qualitative Comparison of saliency maps from different saliency mod
els. The 1st column, the leftmost column, shows the original images from the MIT
database with fixation points from all subjects superimposed as red dots from [28].
The 2nd column is the saliency map created using our saliency model where the hot
ter locations or indicators of high salience. The 3rd column is saliency maps from
the Itti model [29] and the 4th column is saliency maps generated using the GBVS
model [30].
regions near the edge that are identified as salient are more likely to be classified as
false positives in the ROC metric. This aspect of our model, while it does results
in a lower score may not be an actual flaw of the model. Saliency models are not
only meant to predict where people fixate when they look at a picture but where they
fixated when they see the world. In the world there are no hard boundaries, nor is the
136
most salient feature in the current field of view likely to start in the center of your field
of view. A feature that is unique from its surroundings may have high information
content regardless of its location in the field of view. For the ROC metric with CBC,
the treatment of features near the boundary of the image does not m atter, since the
locations used in the metric are only those locations where a person has fixated in
one of the images; these locations are unlikely to be near the edge of the image. In
spite of this higher false positive rate, our model still has the best performance using
the NSS metric whether CBC is used or not. The most likely reason is th at the ROC
metric only compares rank order of saliency values, while the NSS metric looks at the
relative strength of saliency values. This means that while our model has a higher
false positive rate, it assigns a higher value to the correct fixation locations th an the
other models do. So for the saliency maps generated by our model the truly salient
locations, as indicated by human fixation, tend to pop out more.
The numbers in table 5.1, however, do not mean much unless the differences are
statistically significant. To determine whether the results of our model were ever
statistically different from the results of the other models we used the paired t-test;
the results of which are shown in table 5.2. From table 5.2 most of the results in
table 5.1 are in fact statistically significant when comparing our models performance
against the performance of the other two models. The table also shows th at though
the Gaussian kernel does have a high value without knowing anything about the
137
w / o CBC
G au ssian K e rn el
I t ti e t al. 98
H arel e t al. 07
H a rriso n Sc E tie n n e -C u m m in g s
w / CBC
ROC
N SS
0.8739
0.9228
0.9352
1.2461 0.6275
1.4983 0.7565
1.4739 0.7591
0.2143
0.7124
0.7572
0.9086
1.6267
0.9262
ROC
0.7804
N SS
Table 5.1: Quantitative Comparison of 3 models and a Gaussian kernel on the MIT
dataset using different similarity measures. For the similarity measures where CBC
(center bias compensation) was not used, a center bias was added to the final saliency
map or in the case of the GBVS model the inherent center bias in the model was
treated as sufficient. When CBC was used, the raw saliency m ap was used without
any center bias added. For the GBVS model, a different formulation of the graph
network was used to remove the inherent center bias. For almost all measures with
or without CBC, our model outperforms both the Itti and GBVS model. Only in the
ROC measure does our model fail to outperform the Itti or GBVS model, but also
without the CBC though the Gaussian Kernel has a very high value. All values in
the table have been normalized by the average Inter-subject predictability values for
each measure.
input image, the metrics without CBC still have value. The difference between a
metric with CBC and the same metric without it says how the model handles the
boundary of the input image. Though, it is unclear whether a model should suppress
the appearance of salient locations the further they are from the center, to better
match fixation or if the labeling of salient locations near the edge of the image may
be a useful feature.
138
w /o C B C
w / CBC
C um m ings vs.
ROC
NSS
ROC
N SS
G aussian K ernel
I tti et al. 98
H arel e t al. 07
p=9.1H8e-5
p= 8.331e-6
p = 0 .0 l)ll
p=1.489e-(>
p= 0.01 1
p = 0 .0 5 5
p=2.970 o-9
p=0.035
p=0.101
p = 3.535e-10
p= 2.449e-4
p = 0 .0 0 9
Table 5.2: The results of applying significance test (p) on the results of the metrics
used in table 5.1. Colored numbers indicate th at a significant different was found be
tween the data for the models under comparison, green when the results of our model
were significantly better and red when the test indicated that our model performed
significantly worse, numbers in black indicate no significant differences were found in
the data.
5.6
D iscu ssion
140
for center bias may affect the apparent performance of a saliency model. We also
see that by using different similarity measures we can look at different aspects of a
models performance. For instance, since the difference between the performances of
the different models are significantly different with NSS, while not so with ROC; this
may indicate that the saliency maps produced by our model have higher values at the
salient locations than the saliency values other models give to those same locations.
So, while the three models have about the same accuracy in predicting fixation loca
tions it is easier to discriminate salient locations in our model versus the competing
models.
141
C hapter 6
3-D Im agin g C hip
6.1
In trod u ction
In this thesis, the focus has been on processing images, HDR or otherwise, after
they have been created. In this chapter, we discuss work we have done to design and
fabricate a spike based imager fabricated in a 3-D three tier process. Newer fabrica
tion processes have been using lower and lower supply voltages to power chips. For
standard imager designs that measure illumination levels by integrating photocurrent
over a fixed capacitance for a given period of time; this reduction in supply voltage
reduces an already small dynamic range [47]. In the imager we designed, we mea
sure illumination by integrating photocurrent over a capacitance until a set voltage
is reached, rather than for a set time. Thus, a measured illumination level becomes a
142
function of time, rather than voltage. This is becoming an increasingly common way
to design an imager chip and operates in a way analogous to how photoreceptors in
the eye work.
W ithout the need for a constant integration time, these imagers are often imple
mented in an asynchronous manner, with transmission of data initiated by the pixel
rather than scanning circuitry, as specified by the address event representation [115].
In this case, the illumination information is encoded in the spike rate rather than
in a digital word. Dynamic Ranges > 100d B have been reported for these imagers
but a drawback of spike rate encoding is that it consumes more power, and brighter
pixels dominate the bandwidth of the transmission line. Also, since dark pixels send
fewer events, any delay in the output of those events, has a significant effect on the
accuracy of the estimated illumination level. Delays in the transmission of events
for dark pixels are often due to pixels with higher event rates also trying to transm it
events within the same window of time. Pixels with higher event rates, due to brighter
illumination levels, suffer less from transmission delays affecting the accuracy of their
measurements, due to the larger number of events transmitted; thus any delay is
averaged out [116].
It is commonly assumed that in biological systems, like the photoreceptors in
the eyes, information is encoded in the spike rate, but a growing idea is th a t some
biological systems utilize other encoding schemes such as spike ranking and time-to-
143
first spike (TTFS), to transmit information [117,118]. Prom this viewpoint, time-tofirst spike imagers have been developed th at count the time between when the pixel
was reset and the time it takes to reach some reference voltage [119,120]. These
circuits can also achieve high dynamic ranges and in order to m aintain reasonable
integration times < 33ms the reference voltage can be varied, to ensure th at every
pixel spikes at least once per frame. In these circuits, the idea of a frame is restored,
since all pixels are reset at the same time and each pixel only outputs one event per
frame, potentially reducing the power consumption of the imager, since fewer events
must be transm itted in the pixel. However, the power consumption is then shifted
to the timing component, since in order to measure the time between reset and an
event, a high speed clock is required in order to properly measure the TTFS of the
brightest pixels, with reasonable accuracy. TTFS imagers are also more sensitive to
any delays between when an event is generated and the time between reset and when
the event generation is measured.
To reduce the overall power consumed by the imager and tim ing blocks, we have
proposed a spike based imager th at uses both spike rate and time-to-first spike encod
ing. This allows us to achieve a high dynamic range while requiring only moderate
clock speeds for TTFS and tries to minimize the pixel spike rates for spike rate en
coding. We plan to validate this method using a generic spike based imager th at can
operate in either TTFS or spike rate mode with a variable threshold. This chip has
144
been fabricated in a 0.13/ira 3D 3-tier SOI CMOS process and we present simulated
results from the chip.
In the next sections, we describe the overview of the imager chip and layout. We
also describe the mixed mode encoding scheme, the mathematical theory of how it
should improve the accuracy of illumination estimates, and show some simulation
results. In section 6.1.1 we briefly describe the chip, in overview, and show the pixel
design, layout, and the fabricated chip. In section 6.1.2 we describe our proposed
mixed mode encoding scheme and the theory of how it should allow high accuracy high
dynamic range estimates of illumination, while controlling the maximum error and
timing clock speed, and we show how this works using data from SPICE simulations
of our circuit.
6.1.1
The generic spike based imager is an 88 x 88 pixel spiked based imager chip th at
is capable of operating in TTFS mode and spike rate mode. Each pixel is capable
of doing a self-reset after an event has been acknowledged by the arbiter, but the
imager is also capable of doing a global reset to reset all the pixels in the array at
once and preventing further pixel level resets until another global reset has occurred.
The pixel consists of an event generator with a variable threshold reference voltage,
a reset control block to choose between TTFS or spike rate mode, and arbitration
145
MS
V reset
> Request Co,- Re<*>
V in
10
V r f
{Reset
C ol.S tl <
Row R e q }
> G lo b aljtaset
V bad
3
V bias
Figure 6.1: Overall Design of the pixel with a diagram of the event generation shown
in detail and a block view of the asynchronous communication component.
6.1.1.1
T h e P ix el
146
the pixels also share a Global reset. A pixel can only be reset if the Global reset is
low. If the Global reset is left low the pixels operates in spike rate mode resetting as
soon as
is high. If Global reset is left high after the pixel spikes it enters into
a stand-by mode until the Global reset is set low again. This mode is needed to do
time to first spike operation.
of the event generator is connected to the request line of th e communication
block. So when an event is generated by the pixel, a request to transm it an event
occurs. Once Request has gone high, the wired-or CoLReq line is pulled low. Col_Sel
goes high when the arbiter is ready to take requests from th at column.
All the
pixels on that column will then transm it their event by pulling the wired-or Row_Req
line low. Once the column has been acknowledged, the pixel can be reset to begin
integrating photocurrent once again, and if Global_Reset is low the pixel will be in
spike rate mode. Otherwise if Global_Reset is high the pixel will be in time-to-spike
mode and will only reset once GobaLReset has gone low. The per pixel Asynchronous
Communication Block interfaces with row and column arbitration trees and encoders
on the periphery of the array on the AER Interface tier, figure 6.2. The AER readout
is a delay-insensitive word-serial AER as described in [122].
147
AER Interface
Event Generator
and reset control
Photodiode array
Figure 6.2: Image of layout of HDR image sensor showing the 3 tiers 3D chip.
6 .1.1.2
D esig n L ayout
Each
pixel takes an area of 18jim x 18fim over three tiers with one tier consisting only
of the photodetectors. The entire design occupies an area of 2m m x 2m m and the
partitioning of the design over the 3 tiers is shown in figure 6.2. A detail of the pixel
layout over the three tiers is shown in figure 6.3. An image of the fabricated chip is
shown in figure 6.4 and a summary of the chips properties are summarized in table
6 . 1.
Process
Supply voltage
Chip Size
Pixel Size
Fill Factor
Transistor Count
71 per pixel
148
(a)
(b)
(c)
Figure 6.3: Layout of the pixel across the three tiers, (a) The Photodiode, (b) The
Event generator and request and acknowledgement circuits, (c) AER transm it request
and TX /RX control.
6.1.2
The mixed mode operation of the imager was inspired by a mixed mode read out
scheme of a quadrature encoder, which did both spike rate and time between spikes
149
Vthresh_max
V.threshold
Dark Pixel
Output
N
Bright Pixel
Output
^int^ w a it
150
since they have not spiked at all yet. The threshold voltage is then ramped up to
VDD to ensure that all the pixels in the circuit spike before the end of the frame.
This gives us up to two measurements per frame for each pixel, which we can use to
get the most accurate estimate of each pixels value. A diagram showing an overview
of the operating principle of the mixed mode encoding scheme, is shown in Figure
6.5.
6.1.3
The length of time spent in spike rate mode should be set to ensure th at all the
pixels with an integration time below a certain value have the lowest error in spike rate
mode. After which, the readout should be done in TTFS mode. To understand how to
determine this time, we describe a simple model of spike based imaging. During spike
rate mode each pixel integrates photocurrent until the voltage V* ~ Vref generates an
event. After an event has been generated the arbiter sends an acknowledgement signal
to show that the event has been recognized as an event, it also resets the photodiode
to begin the integration process again. After some tim e Ts, N events from th at pixel
will have been transm itted by the arbiter. The spike rate of events from th at pixel is
151
Ts
Ts
I
Is_
|_ t i n t "t" T 'x v a i t
Ts
where tint is the integration time, t wait is the time the pixel must wait for the arbiter
to acknowledge an event, and Tevent is the time between reset and when the tim er
detects an event for that pixel. The variable t wait is highly dependent on the total
spike rate of the imager overall, but for this analysis we will make the assumption
that the arbiter is operating under a light load, so t wait T ^ , the read/w rite time
of the arbiter [123].
We wish to use the spike rate from a pixel in order to estimate the average inte
gration time:
. _____________
_T s
AT t wan
rin tje st
f.
TS.~7^'tV9i
t
Lm t
jy
___________
tin t
tin t
152
T s (N +l)-tu;att
N+l
err
T s N t wait
N
Ts (N+l)-tWait
(6 . 1. 1)
N+l
( 6 . 1. 2 )
During TTFS mode, after a pixel has generated its first event the pixel does not
reset. The pixel only resets after time T/ has passed from the previous reset. For
TTFS, the time taken for an integration to occur is measured as the number of clock
ticks, tdk, that have occurred between reset and when the event was acknowledged.
l event
tclk
tc lk
t w a it
The absolute value of the relative error in the spike time is given by:
\es
^int ~~
^dk ~ ^wait') J
tin t
tinf
tc lk
errr =
( [ ^ J +1)
(6.1.3)
Lw a it
*<*-*>
In Figure 6.6 we have plotted the maximum error vs tint for T T FS and spike
rate modes, using reasonable values for a high dynamic range spike based imager.
153
Regardless of the encoding type; as the integration time becomes very short or very
long one of the two encoding methods suffers. Using a mixed mode readout method
where information from the best source is used, the best source being defined as the
one with the lowest error, the maximum error over the whole dynamic range can be
reduced to 0.06%.
Measurement errorvs. integration time
0.12
0.1
0.08
0.06
0.04
0.02
-2
10-7
tint (s)
Figure 6 .6 : Relative error of TTFS and spike rate encoding vs. integration time.
Using tdk = 12ns, T s = Is, t wait = 70ns.
This level of accuracy is rarely required for most applications and it would be
more beneficial to increase the error in the measurement in order to lower the overall
power consumption by reducing the clock speed t^k and reducing the average spike
rate
that the overall power consumption is reduced. From 6 . 1.2 and 6.1.3 we have two
154
equations describing the maximum error in spike rate and TTFS modes, respectively.
However, to set the maximum error we need to find the value of Tevent where equation
6.1.2 is equal to 6.1.3 giving two equations and three unknowns. We can alleviate
this by finding an equation to describe the power consumption as a function of spike
rate and the clock speed of the timer. However, such an equation is dependent on the
imager design, fabrication process, and the system counting the delay between reset
and the event.
o 0.8
0.2
-8
,-6
-4
tint(s)
Figure 6.7: Theoretical relative error using mixed mode encoding.
Unfortunately, without an explicit equation for the relation between spike rate,
tdki and power consumption we only have two equations and three unknowns. As
155
an example, we can assume th at spike rate and the clock speed of the tim er are
equally important and simply try to set the crossover point to 50/is with a maximum
error of 2~9, to ensure at least 8 bits of accuracy. Prom 6.1.2 and 6.1.3 we find th at
Tg = 25.6ms and tdk = 98ns.
Mixed Mode Encoding Relative Error from a SPICE simulation
0.2
0.15
0.1
0.05
0
10'7
-4-
10
10'5
10"4
u>
Figure 6 .8 : Controlled Relative error using reduced tdk and Ts from the output of a
SPICE simulation.
6.2
Future W ork
W ith the fabrication of the chip we need to test th at the chip works and charac
terize how well the chip works, as the 3D process is a fairly new technology and there
may still be issues with the fabrication process. W ith actual data we may be able to
develop a better model of the measurement errors, and this may allow us to choose
an optimal cross-over point between TTFS and spike rate readout.
157
C hapter 7
C onclusions
7.1
In trod u ction
In this thesis we have taken a unique view on how to compress high dynamic range
images onto a low dynamic range display. We have been concerned with dynamic
range compression in order to maximize the amount of perceivable visual information
in an image. In order to be able to solve this problem, it is first necessary to be able
validate the performance of the system th at is developed. As a first step in determin
ing if a compression method maximizes the amount of visual information shown, we
have used psychophysical tests and an information content quality assessment (QA)
model to compare and measure, respectively, the amount of visual information in a
compressed image. The next step would then be to use these evaluations to find a
158
CHAPTER 7. CONCLUSIONS
compression method that maximizes the amount of visual information shown, which
is discussed in the section on future work.
In this thesis we validated existing compression methods, using psychophysical
studies, to determine which compression methods resulted in the highest amount of
perceivable visual information in an image, after compression. We also made a mod
ification to one of the best performing compression methods, the bilateral filtering
algorithm [23], to improve the amount of perceivable visual information more consis
tently, by adaptively changing a param eter in the method, based on properties of the
image. We validated the improvement through the second psychophysical experiment
but human testing is slow, expensive, and yielded only qualitative results at best.
We then modeled the human validation process by developing our quality assess
ment model, to measure the amount of visual information in compressed images. We
quantified visual information by equating it with entropy and modeled each image
statistically using the statistics of natural scenes. We then compared the measure
ments of the model against the results of the psychophysical experiments, to try
and show that the two were consistent. Prom our preliminary results we have found
that our quality assessment model almost always identifies the autom ated bilateral
filtering algorithm as the tone mapping method that generates images with the high
est information content, which agrees with the psychophysical results. However, the
unweighted output of the quality assessment model does not rank the other tone map
159
CHAPTER 7. CONCLUSIONS
ping algorithms in the same way as they were ranked in the psychophysical study;
in terms of how much information seemed visible after HDR images were compressed
using the different methods.
As a secondary test of the foundations of the model and to provide further evidence
that a model th at only uses statistical properties of images rather than models of
vision was a valid way to model perception, we extended our QA model to create an
ideal observer model; in order to predict fixation. We tested the predicted fixation
locations made by our ideal observer model against human fixation d ata from the
MIT database [28]. We also compared our models fixation predictions against the
predictions of two other bottom-up, feed-forward fixation models, the Itti model and
the GBVS model [29,30]. Using two different fixation comparison metrics we showed
that our model performed as well if not better, in some cases, than other comparable
fixation models.
We have also presented the design of a spike based imager chip fabricated in a 3D
3-tier 0.15/im CMOS process. We also presented a mixed-mode read out scheme th at
uses both TTFS and spike rate encoding to improve the accuracy of measurements
by the chip over the entire dynamic range. We also presented theoretical and SPICE
simulations to justify the proposed improvements of our read out method.
In the next section we present the preliminary results of our work in identifying
potentially salient objects in out-of-focus locations. In section 7.3, we propose the next
160
CHAPTER 7. CONCLUSIONS
step needed to reach the final goal of maximizing the amount of visible information
of tone mapped images. The next step towards this goal is to use the information
content quality assessment model along with the tested tone mapping m ethods to
reach an optimal compression setting. We also discuss the future work of all of the
research projects presented in this thesis.
7.2
The average person with normal vision perceives their surrounding environment
in high resolution and with all objects in sharp focus, but at any given moment th at is
not what we actually see. Only a small portion of our visual field is seen at our highest
spatial resolution and it is also only objects at that depth that are viewed in sharp
focus. When electronic sensor/display systems are used, the display does actually
have a constant spatial resolution but the depth-of-field may not be infinite, so only
parts of the scene may actually be in focus. We must then choose w hat parts of the
scene to keep in focus. This can be done entirely manually, often with a predefined set
of selectable depth settings, or with an auto-focus function. Usually the auto-focus
is set to keep the center of the display in focus. This is a standard approach from
photography and in 1st person shooters games, as the location of interest is often in
161
CHAPTER 7. CONCLUSIONS
the center of the display [124]. But what about the next object or location of interest?
It may not be in the center of the image and if it is at a different depth than the
current depth setting, it will appear out of focus and may not appear as salient as it
may otherwise be.
Focus plays a big part in the salience of an object or location. A persons gaze is
more likely to be drawn to a location th at is in focus than one that is out of focus.
However, just because an object or location appears to have a lower salience, since
it is out of focus, it does not mean it is not actually salient. When the depth-of-field
can change, the best way to evaluate the visual salience of objects and locations in a
scene, at the same time, is when the scene is viewed with an infinite depth-of-field,
much like how we actually perceive the world. But when only a finite depth-of-field
image is available, we want to predict where a person should look or focus next.
There is very little work on predicting salience in finite depth-of-field environments
[35]. All other work involving finite depth-of-field images or environments either use
gaze to create a finite depth-of-field [125,126] or create a finite depth-of-field image
to direct gaze [127-129].
Using our ideal observer saliency model we wish to study where people should
look next for a given finite depth-of-field. The main goal is to be able to produce a
saliency map from a finite depth-of-field image that closely matches the saliency map
of that same scene taken at infinite depth-of-field. To do this we need to be able to
162
CHAPTER 7. CONCLUSIONS
suppress or ignore the locations th at are in focus and then identify salient locations
that are out of focus. Hopefully these potentially salient out of focus locations would
also be salient if the same image was taken at infinite depth-of-field.
Figure 7.1: Example image used to measure saliency in out of focus locations. An
artificial depth field added to put the foreground (b) or the background (c) in focus.
CHAPTER 7. CONCLUSIONS
or location was identified as salient in the version of the image where everything was
in focus. However, in these cases the images were in color, the lens blur due to loss
of focus was very strong, creating a strong separation between in focus and out of
focus, and finally the images were very clean and had very little noise. All of these
factors only helped the saliency model, but may not be true in the monochrome NVG
images taken at finite depths.
The results can be compared with the fixation data for the original image with an
infinite depth-of-field and the saliency map of th at image.
Studying where people want to or should look in finite depth-of-field situations has
applications in understanding the role depth-of-field has on fixation and in developing
a smarter auto-focus that may implement simple gaze prediction where gaze tracking
is impractical or not possible. Knowing where people want to look based on depth
may also help choosing a set of predefined depth settings for quick focus changes.
The next step of this work will be to apply this model to real finite depth-of-field
images and try to predict salience in these images. We may also add to the model
a single pass depth from blur in order partition the input image into depth regions.
Each region can then have different weights associated with the raw salience at each
location in order to cancel out the effect of defocus on saliency prediction. This may
result in a single saliency map from a finite depth-of-field image of some scene th at
matches the saliency map from an infinite depth-of-field version of th a t same scene.
164
CHAPTER 7. CONCLUSIONS
Analyzed Image
Saliency Map
Secondary Salient
Locations
* f-\'V. - V3
v>.
* J
i
(a)
(b)
(c)
Figure 7.2: Identification of secondary salient locations and objects in images with an
artificially created finite depth-of-field. By suppressing the top most salient locations,
column (b), it is possible to identify potential salient locations in out-of-focus regions,
column (c).
165
CHAPTER 7. CONCLUSIONS
7.3
Future W ork
The focus of this thesis has been on evaluating the amount of visual information in
compressed images through psychophysical studies and the development of a model
to measure this value directly. We have shown modest improvement in the amount
of perceived visual information in tone mapped images through an alteration of the
bilateral filter algorithm. However, we havent really tackled the issue of maximizing
the amount of visual information in tone mapped images. We have taken an important
first step in this process by defining what we mean by information and developing a
model to measure it. The next step involves using the results of the QA model for
each tone mapping method in a consistent manner.
Tone mapping methods use very different algorithms to perform their dynamic
166
CHAPTER 7. CONCLUSIONS
range compression, so without some way to look at each method on the same level,
even knowing th at one method shows more visual information th an another, it would
be unclear how to use that information towards finding some optimal compression
setting. However, Mantiuk and Seidel [3] have developed a generic tone mapping op
erator (GTMO) that can emulate the performance of many tone mapping operators.
Using this GTMO it is possible to mimic the tone mapping methods tested in this
study, as well as a host of other tone mapping operators, simply by changing pa
rameter settings. Thus with the knowledge of the parameter settings th at mimic the
performance of the tested tone mapping algorithms, it is possible to compare most
tone mapping algorithms on an equal footing. Along with the ICQA model it would
then be possible to discover the compression settings th at maximize the amount of
visual information in a tone mapped image.
Before a maximal compression setting can be reached there are many open ques
tions that may first need to be answered. The GTMO model has several param eters
that it uses and it is an unclear how to translate the model output and the changing
parameters into an optimization equation. It is also unclear if there will be a single
optimal setting for a single image or a single optimal setting for a set of images. It is
also unclear that if the ICQA model was made to be consistent with the psychophysi
cal experiment results for the different tone mapping methods, from chapter 3.4, th at
it will remain this way for all potential compression settings.
167
CHAPTER 7. CONCLUSIONS
7.4
Sum m ary
We have described and designed the components needed to find and produce the
optimal compression of high dynamic range images in order to maximize the amount
of perceivable information displayed to a viewer. To do this we described the param
eters of visual perception and their limits. We then developed an information content
quality assessment model to measure the amount of visual information in tone mapped
images. To validate the approach we took in designing the model, namely using only
concepts from information theory and ignoring any explicit biological elements, we
modified our quality assessment model so that it could predict visual fixation. We
tested the predictions of our model against previously collected human data and
against the predictions of other bottom -up feed forward models of fixation. To test
the ICQA model directly we ran the model over images from the second psychophys
ical study and compared the model results against the psychophysical data. The
psychophysical experiments we ran compared different dynamic range compression
methods in terms of the amount of information th at is visible after compression [1].
The final component needed in order to maximize the amount of perceivable visual
information displayed to a user after tone mapping is to combine the ICQA model
output with the tone mapping d ata into the generic tone mapping model. We have
also, presented preliminary work in the detection of salient locations in finite depthof-field images. Finally, we have presented the design of our spike based imager and
168
CHAPTER 7. CONCLUSIONS
the mixed-mode read out scheme to improve the accuracy of the measurement from
each pixel over the entire dynamic range.
In this chapter we presented an approach th at allows the results of our ICQA
model and the information from the different tone mapping methods to be combined,
to determine and reach an optimal compression setting so that the amount of per
ceivable visual information is maximized on a display after tone mapping. We also
presented our work in modeling fixation, while compensating for loss of focus due to
a finite depth-of-field.
169
CHAPTER 7. CONCLUSIONS
Analyzed Image
Saliency Map
Secondary Salient
Locations
W'" '
T
(c)
(d)
*>WV.
ft,
(e)
B ibliography
[1] A. Harrison, L. Mullins, and R. Etienne-Cummings, Sensor and display human
factors based design constraints for head mounted and tele-operation systems,
Sensors , vol. 11, no. 2, pp. 1589-1606, 2011.
IEEE, 2012.
Available:
http://www.blackwell-synergy.com/doi/abs/10.1111/ j . 1467-8659.
2008.01168.x
[4] A. Harrison, R. Ozgun, J. Lin, A. Andreou, and R. Etienne-Cummings, A spike
based 3d imager chip using a mixed mode encoding readout, in B iom edical
171
BIBLIOGRAPHY
190-193.
[5] E. Reinhard, G. Ward, S. Pattanaik,
range
imaging.
Morgan
Kaufman,
[Online].
Available:
http:
//anyhere.com /gward/papers/REINHARD-Flyer.pdf
[6] V. G. CuQlock-Knopp, W. Torgerson, D. E. Sipes, E. Bender, and J. Merritt,
A comparison of monocular, biocular, and binocular night vision goggles for
traversing off-road terrain on foot, Army Research Laboratory, Tech. Rep.,
1995.
[7] field of view: perimeter chart, in Art. Encyclopdia Britannica. Encyclopdia
Britannica Online Academ ic E dition. Encyclopdia Britannica Inc., 2012. Web.,
2012 .
172
BIBLIOGRAPHY
pp. 393-398,
http://www.opticsinfobase.org/abstract.cfm?URI=josaa-2-3-393
[12] J. Ferwerda, S. Pattanaik, P. Shirley, and D. Greenberg, A model of visual
adaptation for realistic image synthesis, in Proceedings o f the 23rd annual
conference on Computer graphics and interactive techniques.
C. Posch,
and T.
Delbruck,
A 128 x
responds to
IEEE, 2006,
128 120db
relative
Circuits
intensity
Conference
pp. 2060-2069.
[Online].
Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1696265http:
//ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1696265
[14] J. Dubois, D. Ginhac, M. Paindavoine, and B. Heyrman, A 10,000 fps
cmos sensor with massively parallel image processing, IEEE Journal o f
173
BIBLIOGRAPHY
Solid-S tate C ircuits, vol. 43, no. 3, pp. 706-717, Mar. 2008. [Online]. Available:
http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4456774
[15] T.
Etoh,
D.
Poggemann,
G.
Kreider, H.
Mutoh,
A.
Theuwissen,
and Y. Takano,
100
iris-less cmos
o f S o lid -S ta te
arnumber = 1696161
BIBLIOGRAPHY
[19] B. Jones,
Retina display
ipad ONLINE,
April
no. 3,
pp. 419-426,
Sep.
2003.
[Online]. Available:
http:
/ / www.blackwell-synergy.com/links/doi /10.1111/1467-8659.00689
[21] G. Larson, H. Rushmeier, and C. Piatko,
Available:
646233
[22] J. Dicarlo and B. Wandell, Rendering high dynamic range images, in Pro
ceedings o f the SPIE: Sensors and Camera System s for Scientific, Industrial,
and Digital Photography Applications, vol. 3965.
[23] F. Durand and J. Dorsey, Fast bilateral filtering for the display of highdynamic-range images, in Proceedings o f the 29th annual conference on Com
puter graphics and interactive techniques - SIG G RA PH 02.
BIBLIOGRAPHY
[26] A. Yoshida, Perceptual evaluation of tone mapping operators with realworld scenes, Proceedings o f SPIE , pp. 192-203, 2005. [Online]. Available:
http: / /link.aip.org/link/?PSI /5666/192/1 &cAgg=doi
[27] R. J. Peters, A. Iyer, L. Itti, and C. Koch, Components of bottom -up gaze
allocation in natural images, V ision research, vol. 45, no. 18, pp. 2397-416,
Aug. 2005. [Online]. Available: http://www.sciencedirect.com /science/article/
pii/S0042698905001975http://www.ncbi.nlm.nih.gov/pubmed/15935435
[28] T.
Judd,
K.
Ehinger,
F.
Durand,
in
and
A.
C o m p u ter
Torralba,
Vision,
2009
Learning to
IE E E
12th
http: / /ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5459462
[29] L. Itti,
BIBLIOGRAPHY
and M achine Intelligence , vol. 20, no. 11, pp. 1254-1259, 1998. [Online].
Available:
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=730558http:
//ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=730558
[30] J. Harel, C. Koch, and P. Perona, Graph-based visual saliency, A d van ces in
neural in form ation processing sy ste m s , vol. 19, pp. 545-552, 2007.
pp. -.
[35] L. Cong, R.-F. Tong, and D.-Y. Bao, Detect saliency to understand a photo,
177
BIBLIOGRAPHY
midget and parasol ganglion cells of the human retina, Proceedings o f the
National Academy of Sciences , vol. 89, no. 20, 1992. [Online]. Available:
and R. F. Hess,
Human peripheral
47-64-,
1991.
[Online].
The Journal
Available:
limits imposed
of physiology,
vol. 442,
citeulike-article-id:2773326http:
and S. M. Macrae,
How
178
Far
Can
We
E xtend
the
L im its
BIBLIOGRAPHY
o f Human
Vision?
line]. Available:
http://books.google.com/books?id=QEjW hM o-yosC&
printsec=frontcover&sig=sqLtWtwTFFcWxt-hS0ffvnceZlE
[41] Y. Yeshurun and L. Levy, Transient spatial attention degrades tem poral
resolution, Psychological Science , vol. 14, no. 3, pp. 225-231, 2003. [Online].
Available: http://pss.sagepub.eom /content/14/3/225.short
[42] B. Rogowitz, The human visual system: A guide for the display technologist,
Proceedings o f the Society fo r inform ation display , vol. 24, no. 3, pp. 235-252,
179
BIBLIOGRAPHY
stereopsis in human depth perception, Vision research, vol. 22, pp. 261-270,
1982. [Online]. Available:
0042698982901262
[47] E. R. Fossum and L. C. Photobit, Cmos image sensors: electronic camera-on-achip, Electron Devices, IEEE Transactions on, vol. 44, no. 10, pp. 1689-1698-,
1997.
[48] O. Yadid-Pecht and A. Belenky, In-pixel autoexposure cmos aps, Solid-State
Circuits, IEEE Journal of, vol. 38, no. 8 , pp. 1425-1428-, 2003.
[49] T. Delbruck and C. A. Mead, Analog vlsi adaptive logarithmic wide-dynamierange photoreceptor, IEEE ISCAS, pp. 339-342-, 1994.
[50] S. Decker, D. McGrath, K. Brehmer, C. G. Sodini, and C. Mit, A 256 x 256
cmos imaging array with wide dynamic range pixels and column-parallel digital
output, Solid-State Circuits, IEEE Journal of, vol. 33, no. 12, pp. 2081-2091-,
1998.
[51] P. E. Debevec and J. Malik, Recovering high dynamic range radiance maps
180
BIBLIOGRAPHY
P. Shirley,
and J. Ferwerda,
Photographic
no. 3,
pp.
267-276,
Jul.
2002.
[Online].
Available:
http:
//portal.acm.org/citation.cfm?doid=566654.566575
[54] T. O. Aydin, R.
assessment,
ACM
Transactions
Available:
on
http:
//portal.acm.org/citation.cfm?doid=1360612.1360668
[55] S. Daly, The visible differences predictor: an algorithm for the assessment of
image fidelity, PRO CEED IN G S- SP IE THE IN T E R N A TIO N A L S O C IE T Y
FOR
O PTIC A L
ENGINEERING,
no.
1666,
1992.
[Online]. Available:
http://www.ece.mtu.edu/faculty/ztian/ee5950/Ong_article.pdf
[56] M. Fairchild and G. Johnson,
BIBLIOGRAPHY
#
Fairchild and
G. Johnson,
Meet icam:
A next-generation color
System s,
Technologies,
Applications
Color Science
10.
Society
for
https://ritdm l.rit.edu/handle/1850/7860
[58] N.
Hautiere, J.-P.
Blind contrast
http://www.ias-iss.org/ojs/IAS/article/view/834
[59] J. Kuang, G. Johnson, and M. Fairchild, icam06: A refined image appearance
model for hdr image rendering, Journal o f Visual Communication and Image
Representation , vol. 18, no. 5, pp. 406-414, October 2007. [Online]. Available:
http://www.sciencedirect.com/science/article/pii/Sl047320307000533
[60] E. Larson and D. Chandler, Most apparent distortion: full-reference image
quality assessment and the role of strategy, Journal of Electronic Imaging,
2010. [Online]. Available: http://link.aip.org/link/jeim e5/vl9/il/p011006/sl
BIBLIOGRAPHY
[61] R. Mantiuk, S. Daly, K. Myszkowski, and H.-P. Seidel, Predicting visible dif
ferences in high dynamic range images - model and its calibration, in Human
Vision and Electronic Imaging X , Proc. o f SPIE, vol. 5666pp, pp. 204-214,
2005.
[62] R. Mantiuk, K. Myszkowski, and H.-P. Seidel, Visible difference predicator
for high dynamic range images, 2004 IEE E International Conference on
System s, Man and Cybernetics (IEEE Cat. No.04CH37583), pp. 2763-2769,
http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.
htm?arnumber=1400750
[63] S. Pattanaik and M. Fairchild, Multiscale model of adaptation, spatial
vision and color appearance, in IS & T /S ID 6th Color Imaging Conference ,
Scottsdale, 1998, pp. 2-7. [Online]. Available: http://w w w .inventoland.net/
imaging/ JEI/002.PDF
[64] S. Pattanaik and J. Ferwerda, A multiscale model of adaptation and spatial
vision for realistic image display, in Proceedings of S IG G R A P H 98, A C M
SIGGRAPH.
183
BIBLIOGRAPHY
Graphics Forum, vol. 25, no. 3, pp. 427-438, Sep. 2006. [Online]. Available:
http://www.blackwell-synergy.eom/doi/abs/10.llll/j.1467-8659.2006.00962.x
[68] Z. Wang and A. C. Bovik, Image quality assessment: From error visibility to
structural similarity, IEE E Transactions on Image Processing, vol. 13, no. 4,
pp. 1-14, 2004.
[69] Z. Wang and A. Bovik, Modern image quality assessment, Synthesis Lectures
on Image,
184
BIBLIOGRAPHY
SY STE M S A N D
2003. [Online].
Available: http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1292216
[71] J. Lubin, A human vision system model for objective picture quality
measurements, in Broadcasting Convention, 1997. International. International
Broadcasters Convention, 1997, pp. 498-503. [Online]. Available:
http:
/ /ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=626903
[72] A. Srivastava, A. Lee,
On advances in
vision,
vol.
18,
no.
1,
pp.
17-33,
2003.
[Online],
Available:
http://www.springerlink.com/index/JT354188Q4685L29.pdf
[73] M. J. Wainwright, E. P. Simoncelli, and A. S. Wilsky, Random cascades on
wavelet trees and their use in analyzing and modeling natural images, A pplied
and Computational Harmonic Analysis, vol. 11, no. 1 , pp. 89-123, 2001.
185
BIBLIOGRAPHY
vol.
265,
no.
1394,
pp.
359-366,
1998.
[Online].
Available:
http:
//rspb.royalsocietypublishing.org/content/265/1394/359.short
[76] E. Simoncelli and B. Olshausen,
http://www.annualreviews.org/doi/pdf/10.1146/
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1240101
[79] D. Gao, S. Han, and N. Vasconcelos, Discriminant saliency, the detection
of suspicious coincidences, and applications to visual recognition. IEEE
transactions on pattern analysis and machine intelligence, vol. 31, no. 6 ,
http://ieeexplore.ieee.org/
articleDetails.jsp?arnumber=4770107&contentType=Journals-t-&-|-Magazines
186
BIBLIOGRAPHY
computational
2009.11-06-391
[81] D. Gao, V. Mahadevan, and N. Vasconcelos,
On the plausibility of
Top-down
IEEE,
BIBLIOGRAPHY
Singapore: World
Scientific Publishing Co. Pte. Ltd., 2002 , pp. 236-248. [Online]. Available: http:
/ /books.google.com/books?hl=en&:amp;lr=&id=ovfVfSH_tiMC&
oi=fnd&pg=PA236&dq=natural+scene+perception+visual&
ots=deyXNhYM3j&sig=h8oxmpg7r8MaN6veMItDCxCMz58http:
//eproceedings.worldscinet.com/9789812777256/9789812777256_0019.html
[85] L.
Itti
overt
vol.
and
and
40,
C.
Koch,
covert
no.
shifts
10-12,
pp.
saliency-based
of
visual
1489-1506,
search
attention,
Jun.
2000.
mechanism
for
Vision
Research ,
[Online].
Available:
http://www.sciencedirect.com/science/article/pii/S0042698999001637http:
/ /linkinghub.elsevier.com/retrieve/pii/S0042698999001637
[86] Y. Lin, B. Fang, and Y. Tang, A computational model for saliency maps by
using local entropy, in A A A I Conference on A rtificial Intelligence , 2010. [On
line] . Available:
viewFile/1831/2127
[87] D.
J.
salience
Parkhurst,
in
the
K.
Law,
allocation
and
of
E.
overt
188
Niebur,
visual
Modeling the
attention,
role of
Vision
R e-
BIBLIOGRAPHY
search ,
vol. 42,
no. 1,
pp.
107-123,
Jan.
2002.
[Online]. Available:
Journal
o f vision ,
vol.
11,
no.
3,
Jan.
2011.
vol.
12,
pp.
97-136,
1980.
[Online]. Available:
http: / / www.sciencedirect.com/science/article/pii/0010028580900055
[90] C. Koch and S. Ullman, Shifts in selective visual attention: towards the
underlying neural circuitry. Human Neurobiology , vol. 4, pp. 219-227, 1985.
[Online]. Available: http://papers.klab.caltech.edu/104/l/200.pdf
[91] T. Kadir and M. Brady, Saliency, scale and image description, International
Journal o f Computer Vision , vol. 45, no. 2 , pp. 83-105, 2001. [Online].
Available: http://www.springerlink.com/index/T45N2G8543574026.pdf
[92] N. Tamayo and V. Traver, Entropy-based saliency computation in log-polar
images,
189
BIBLIOGRAPHY
Vision Theory and Applications, 2008, pp. 501-506. [Online]. Available:
http://marmota.dlsi.uji.es/W ebBIB/papers/2008/0-C2_132_Traver.pdf
[93] W. Wang, Y. Wang, Q. Huang, and W. Gao, Measuring visual saliency
by site entropy rate, in C om puter Vision and Pattern Recognition ( C V P R ),
2010 IEEE Conference on.
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5539927
[94] N. D. B. Bruce and J. K. Tsotsos,
http://w ww.cse.yorku.ca/~tsotsos/
HomepageofJohnK_files/NIPS2005_0081.pdf
[95] N. D. B. Bruce and J. K. Tsotsos, Saliency, attention, and visual search: an
information theoretic approach, Journal o f vision, vol. 9, no. 3, pp. 5.1-24, Jan.
2009. [Online]. Available: http://www.journalofvision.Org/content/9/3/5.short
[96] L. Zhang,
M. H. Tong,
trell,
Sun:
tics,
Journal
T. K. Marks,
H. Shan,
and G. W. Cot
[Online]. Available:
vol.
8,
no.
7,
pp. 32.1-20,
Jan.
2008.
http://w .journalofvision.org/content/8/7/32.shorthttp:
/ / www.ncbi.nlm.nih.gov/pubmed/19146264
[97] M. Yeh, J. L. Merlo, C. D. Wickens, and D. L. Brandenburg, Head up versus
190
BIBLIOGRAPHY
head down: The costs of imprecision, unreliability, and visual clutter on cue
effectiveness for display signaling, Human Factors: The Journal o f the Human
Factors and Ergonomics Society , vol. 45, no. 3, pp. 390-407, 2003. [Online].
Available: http://hfs.sagepub.eom/cgi/doi/10.1518/hfes.45.3.390.27249
[98] J. M. Wolfe, Guided search 2.0 a revised model of visual search, P sychonom ic
Bulletin & Review , vol. 1, no. 2, pp. 202-238, Jun. 1994. [Online]. Available:
http://www.springerlink.com/content/c0234t6313755617/
[99] W. Watkins, V. CuQlock-Knopp, J. Jordan, A. Marinos, M. Phillips,
and J. Merritt,
Sensor fusion:
in
http:
/ /link. aip.org/link/?PSISDG/4029/59/1
[100] R. N. Shepard, Attention and the metric structure of the stimulus space,
Journal o f M athematical Psychology , vol. 1, no. 1, pp. 54-87, jan 1964.
[Online].
Available:
vol. 275,
no.
1649,
pp. 2299-308,
Oct. 2008.
http://www.pubmedcentral.nih.gov/articlerender.fcgi?
artid=2495046&tool=pmcentrez&rendertype=abstract
191
BIBLIOGRAPHY
attention,
Advances in neural inform ation processing system s, vol. 18, p. 547, 2006.
Wiley
C.
vol. 40,
International Journal
of
C om puter
http:
/ / www.springerlink.com/index/W5515K817681125H.pdf
[106] B. W. Tatler, The central fixation bias in scene viewing:
Selecting an
BIBLIOGRAPHY
selection: effects of scale and time, Vision research, vol. 45, no. 5, pp. 643-59,
Mar. 2005. [Online]. Available: http://www.sciencedirect.com /science/article/
pii/S0042698904004626ht tp: / / www. ncbi.nlm.nih. gov/pubmed/15621181
[108] R. Carmi and L. Itti, The role of memory in guiding attention during
natural vision. Journal o f vision, vol. 6 , no. 9, pp. 898-914, Jan. 2006.
[Online]. Available: http:/ /w w w .journalofvision.org/content/6/9/4.shorthttp:
/ / www.ncbi.nlm.nih.gov/pubmed/17083283
[109] Y. Gousseau and F. Roueff, Modeling occlusion and scaling in natural
images,
Multiscale
Modeling
&
Simulation,
vol. 6 ,
no.
1,
p.
105,
images,
chine Intelligence,
Available:
IEEE
Transactions
vol. 23,
no. 4,
on
Pattern
A nalysis
and
M a
[Online].
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=917579http:
//ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=917579
[111] S. Winkler, Vision models and quality metrics for image processing applica
tions, Ph.D. dissertation, ECOLE POLYTECHNIQUE FEDERALE DE LAU
SANNE,, 2000.
BIBLIOGRAPHY
[112] P. Burt and E. Adelson, The laplaeian pyramid as a compact image code,
IEEE Transactions on Communications, vol. 31, no. 4, pp. 532-540, Apr.
http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.
ht m? arnum ber=1095851
[113] A. Calderbank, I. Daubechies, W. Sweldens, and B. Yeo, Lossless image
compression using integer to integer wavelet transforms, in Im age Processing,
1997.
Proceedings.,
vol. 1.
IEEE, 1997,
image processing
a publication
of the
IEEE
Signal Processing
http://www.ncbi.nlm.nih.gov/pubmed/17076403
[115] K. Boahen, Point-to-point connectivity between neuromorphic chips using ad
dress events, IEEE Transactions on Circuits and System s II: Analog and D ig
ital Signal Processing, vol. 47, no. 5, pp. 416-434, 1999.
194
BIBLIOGRAPHY
high precision position and velocity measurements chip with serial peripheral
interface, Integration, the VLSI Journal, vol. 41, no. 2, pp. 297-305, Feb. 2008.
[117] A. Delorme, Face identification using one spike per neuron: resistance to image
degradations, Neural N etworks, vol. 14, no. 6-7, pp. 795-803, 2001.
[118] R. Rullen and S. Thorpe, Rate coding versus temporal order coding: w hat
the retinal ganglion cells tell the visual cortex, Neural Com putation, vol. 13,
no. 6 , pp. 1255-1283, 2001. [Online]. Available: http://w ww.m itpressjournals.
org/doi/abs/10.1162/08997660152002852
[119] C. Shoushun and A. Bermak, Arbitrated time-to-first spike cmos image sensor
with on-chip histogram equalization, IEEE transactions on Very Large Scale
Integration (V LSI) System s, vol. 15, no. 3, pp. 346-357, Mar. 2007.
195
BIBLIOGRAPHY
http://ieeexplore.ieee.org/lpdocs/epic03/w rapper.
ht m? arnumber=5010336
[123] K. Boahen,
Analysis
http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1310500
[124] A. Kenny, H. Koesling, D. Delaney, S. McLoone, and T. Ward,
in
19th
(ECM S 2005.
European
Riga:
Conference
on
Modelling
and
Sim ulation
IEEE, 2008,
http://ieeexplore.ieee.org/articleDetails.jsp?
arnumber=4480749&contentType=Conference+Publications
196
BIBLIOGRAPHY
New York, New York, USA: ACM Press, Jul. 2010, p. 1. [Online].
download?doi=10.1.1.28.9079&rep=repl&type=pdf
197
V ita
Ik
the Engineer in 2011. His research in image analysis focuses on studying and devel
oping mathematical models of perception, from an information theory perspective, to
improve the appearance of features and details in images. Andres work has been pub
lished at the IEEE Biomedical Circuits and Systems (BioCAS) conference, the IEEE
Conference on Information Sciences and Systems (CISS) and in the MDPI Journal
Sensors.
198
VITA
199