Stereo Vision For Face Recognition Dissertation

A Correlation Based Stereo Vision System for Face Recognition Applications
April 2004
Daniel Bardsley (djb01u)
djb01u@cs.nott.ac.uk
A Correlation Based Stereo Vision System For Face

Recognition Applications
Daniel Bardsley
Supervised by: Bai Li

bai@cs.nott.ac.uk
2004
University of Nottingham
Page 1 of 56
April 2004
Contents
1 Abstract .....................................................................................................................5
2 Introduction ...............................................................................................................5
3 Goals and Motivation ................................................................................................6
4 Literature Review ......................................................................................................8
4.1 Face Recognition ....................................................................................................8
4.2 3D Reconstruction ..................................................................................................9
4.3 Surface Estimation................................................................................................11
4.4 Summary ..............................................................................................................12
5 System Outline ........................................................................................................13
6 Calibration ...............................................................................................................15
6.1 Intrinsic and Extrinsic Parameters.........................................................................15
6.2 Parameter Estimation ...........................................................................................16
6.3 Calibration Testing ................................................................................................18
7 Rectification.............................................................................................................21
8 Correlation ...............................................................................................................23
8.1 Input Point Detection.............................................................................................23
8.2 Intensity Based Pixel-wise Correlation ..................................................................23
8.3 SSD ......................................................................................................................24
8.4 ZMNCC.................................................................................................................24
8.5 Correspondence Testing.......................................................................................25
8.6 Matching Constraints ............................................................................................27
8.7 Constraint Testing.................................................................................................27
8.8 Alternative Correspondence Measures .................................................................28
9 Projective Reconstruction ......................................................................................30
9.1 Reconstruction Testing .........................................................................................32
10 Surface Estimation..................................................................................................34
10.1 Surface Estimation Testing ...................................................................................34
10.2 Texture Mapping ...................................................................................................36
11 Implementation........................................................................................................37
11.1 Design Choices.....................................................................................................37
11.2 Application Architecture ........................................................................................38
11.3 Data Structures and Algorithms ............................................................................42
Page 2 of 56
April 2004
11.4 Implementation Results.........................................................................................44
12 Software Libraries ...................................................................................................46
13 Results .....................................................................................................................48
14 Conclusions and Future Work ...............................................................................52
15 Bibliography ............................................................................................................53
Page 3 of 56
April 2004
List of Figures
Figure 1: High level outline of the reconstruction system. ......................................................13
Figure 2: Calibration Dialog Screenshot.................................................................................19
Figure 3: Artificial Test Rig Camera Configuration .................................................................20
Figure 4: Graphical representation of epipolar geometry. ......................................................21
Figure 5: A Rectified Input Image Pair ...................................................................................21
Figure 6: Stereo pair input image (left) and ground truth disparity data (right)........................25
Figure 7: SSD (left) and ZMNCC (right) disparity maps..........................................................26
Figure 8: Effects of constraint application...............................................................................28
Figure 9: 3D Studio Max cube reconstruction. Test input (purple cube) and reconstructed
output (red spheres) shown from left, right, top and perspective views. ........................32
Figure 10: Reconstructed Cube output after mesh triangulation and surface construction.....35
Figure 11: Original face rendering (left) and the reconstructed mesh (middle) along with a full
surface reconstruction (right). ........................................................................................35
Figure 12: Texture mapped model reconstruction ..................................................................36
Figure 13: The raw data view displays actual pixel co-ordinates of the point matches,
reconstructed points, normalised model co-ordinates and raw calibration data. ............39
Figure 14: Simplified UML diagram of the user interface / MFC portion of the FaceScanner
Application. Some fields and methods have been omitted for conciseness. .................40
Figure 15: Simplified UML diagram of VisionLib, the library containing all the computer vision
related code within the project. ......................................................................................41
Figure 16: FaceScanner application screenshot ....................................................................45
Figure 17: Fully automatic reconstruction of a synthesized face from stereo images. White
dots on the 3d model show initial point match positions. ...............................................50
Page 4 of 56
April 2004
1 Abstract
Three dimensional reconstruction using stereo vision is an important topic of research in

computer science. A computers ability to perceive the world in which it is situated has
application in many areas of industry, likewise, face recognition is an area of comparable
interest. The fusion of the two subject areas should allow the differing techniques to
complement each other in a fashion that allows improvements in recognition results and a
robustness against variant recognition conditions. We explore the reconstruction process in
general and the specifics of implementing a stereo vision system aimed at reconstructed face
surfaces in three dimensions with particular attention being paid to the suitability of the output
models for face recognition systems.
2 Introduction
Computer Vision is one of the fastest growing areas within Computer Science. Aided by rapid
recent progress in hardware and software design, computer vision projects are making use of
vast increases in processing and memory capacities to enhance their performance. In order
for computers to effectively processes, segment and analyse visual input of their
environments it is often a requirement that the system is able to obtain data of the
surrounding world in a format that can be easily equated to the actual environment in which
the system finds itself. In the case of many vision systems this could be a 3 dimensional
representation of the real world. For humans this is a task that we achieve quite naturally
from an early age and it soon becomes second nature for us to accurately judge distance,
perspective and space, however, when human vision systems are analysed it becomes
apparent that the brain uses a multitude of techniques to give us a sense of the three
dimensional world in which we live.
In order for a vision system to obtain depth data from a scene it is possible to use a number of
different techniques. Three dimensional scene data can be obtained from sources including
object shading, motion parallax data, structured light or laser range finders. However,
perhaps the most obvious technique is that of stereo vision. In a system analogous to a pair
of human eyes, the input to two cameras observing the same scene can be analysed and the
differences between the two images used to compute object depth and hence a model of the
scene that the system is viewing. The utilities of a robust implementation of such a system
are many and potentially include applications in areas such as space flight [18], face
recognition [23], immersive video conferencing [56] and industrial inspection [20] to name just
a few.
Page 5 of 56
April 2004
3 Goals and Motivation
Traditionally face recognition algorithms have been able to achieve high levels of accuracy
when the subject face is presented in a frontal pose. Balasuriya and Kodikara describe one
such system in [6] which utilises principle component analysis to achieve reasonable levels of
accuracy. Indeed, [45] furthers this work to produce a system which is capable of face
authentication under difficult image conditions such as “linear and non-linear illumination
change, white Gaussian noise and compression artefacts”, [45]. Both of these systems and
many others suffer from reduced rates of accuracy when a none frontal face pose is used as
input thus reducing their usefulness for many application areas where they may otherwise
have been implemented. Despite work to develop more advanced algorithms that display a
higher degree of pose invariance [32, 43, 58] greatest accuracy seems to be achievable only
when the face is presented in a frontal pose. Work to create systems with a higher degree of
pose invariance has led down a number of paths including, for example, the development of
systems that synthesize face images under varying pose conditions and then use these
synthesized images as a basis for recognition [34]. Other systems fully reconstruct a 3D face
surface either from scratch [1] or by deforming a generic head model [23]. Input into the
various face reconstruction systems ranges from structured light, range scanner data or
standard video images. The first two of these options require hardware in addition to image
capture devices to obtain range data and will often require the subject to be positioned in a
controlled environment whilst the data is acquired. As a more versatile solution, applicable to
a wider variety of applications, the use of standard images as input is preferable. This
method, however, suffers from a reduced level of reconstruction accuracy when compared to
hardware range finders due to a greater number of interfering factors (illumination, pose,
background clutter etc.) and inherent difficulties in implementing successful correlation
techniques.
Our work will consider generating a face surface model, initially without the aid of a generic
head model, from a single pair of calibrated stereo images captured from standard CCTV
cameras. Output will be produced in the form of a 3D, texture mapped surface model of the
subject face with the additional aim that the models produced will be suitable for input at a
later date into a recognition system. The motivation behind this choice is the desire to
produce high quality reconstructions without the aid of dedicated range finding hardware and
to potentially use the reconstructions as part of a pose invariant recognition system. In order
to implement such a system algorithms and software will need to handle initial calibration, the
correspondence problem, mesh generation and object reconstruction. Since a wide variety of
methods and techniques are required simply to obtain a reconstruction the process of
recognition is beyond the scope of this project. Furthermore since a wide variety of
algorithms could be implemented at each stage of the reconstruction the project will focus
Page 6 of 56
April 2004
some attention on the development of a robust reconstruction framework for testing, analysis
and comparison of each algorithm. A final aim throughout development will be to ensure that
all the general computer vision algorithms which may prove useful in other applications
should be implemented in a reusable library to enable their functionality to be leveraged in
future projects.
As a final general aim it is important that the application interface is of commercial standard.
Reconstruction entails high data collection, processing and display demands as well as
relatively complicated user interaction with this data. To this end data must be displayed and
collected in an intelligent and intuitive environment in order that the volume of data be
analysed in a useful manner.
In summary the high level aims of the project are:
• Research and develop methods of stereo camera rig calibration using standard
CCTV cameras and appropriate capture cards.
• Development of solutions to the correspondence problem.
• Research into appropriate mesh generation and surface reconstruction algorithms.
• Produce surface models suitable for input into a potentially pose invariant recognition
system.
• Create a working implementation of a face surface reconstruction application with a
commercial strength user interface.
Application specific implementation goals are discussed in greater detail in the

implementation section.
Page 7 of 56
April 2004
4 Literature Review
Due to the wide variety of possible implementations of various 3D vision systems much
research has been devoted to the field. The effectiveness of different methods and their
appropriateness for specific applications has been widely considered. Moyung provides an
excellent overview of many of the fundamental problems associated with stereo vision [42]
however, here we will consider the development of a multitude of algorithms and techniques
for both the recovery of scene depth information and the reconstruction of the acquired three
dimensional data, paying particular attention to suitability of the data for face recognition
applications.
4.1 Face Recognition
Analysis of the human visual system proved that we can quickly recognise a large number of
faces, suggesting the human brain may only use a small number of parameters to identify
each face [30]. The problem of compressing face data to a few parameters without reducing
accuracy is non-trivial. Principle component analysis can be utilised in order to aid the
paramatisation of this data, and through feature comparison, early face recognition systems
often used this and related techniques to achieve recognition [3, 29]. Principle component
analysis involves the selection of image features such that the original data is represented
accurately by a reduced data set. For example if one image feature can be accurately
predicted from another then the image point is clearly redundant and need not be included in
the dataset. Furthering the idea of removing redundant image features we can create new
ones as functions of the old features. Forsyth and Ponce state “in principle component
analysis, we take a set of data points and construct a lower dimensional linear subspace that
best explains the variations of these data points from their mean” [19]. The PCA compressed
form of the images were known originally as eigenpictures and more recently when applied
specifically to face recognition applications, eigenfaces. Here, faces were indexed, stored
and compared in eigenface form for recognition. Yambor, Draper and Beveridge discuss a
variety of recent PCA based approaches to face recognition in [57].
Whilst it is possible to achieve good results using these methods faces are generally required
to be in a frontal pose. Several modifications have been proposed in order to overcome this
limitation. Investigations and research into eigenspaces as a solution to this and other variant
image property problems are discussed in [43] and [58] with the latter paying particular
attention to the robustness of the eigenspace solution under variant pose conditions.
However, whilst these solutions improve pose invariance to a degree they do not provide a
general solution to the frontal pose problem.
Page 8 of 56
April 2004
Taking a different approach several projects have been undertaken to investigate the
potential of utilising a true three dimensional representation of the face in an effort to give
recognition systems a greater degree of pose invariance. Beumier and Acheroy, [8], present
such a system for automatic face recognition from 3D surface models. The system uses
projected parallel white light stripes to reconstruct depth information by analysing the
deformation of the projected pattern. The authors describe the development of a system
which is “adapted to facial surface acquisition”, [8], to make use of domain specific
knowledge. Examples of this type of domain knowledge include the utilisation of symmetric
face properties and other invariant face features to aid the reconstruction in the feature
correspondence stages. In contrast to this technique [23] utilises basic stereo image pairs as
input without the help of additional reconstruction aids. Furthermore, rather than attempting
to use the 3D data in a direct reconstruction the system deforms a generic head model to the
parameters specified by the acquired data. The advantages of such a system usually include
increased accuracy and speed due to the amount of data initially available to the system in
the form of the generic model. However the system is specifically designed to only consider
head models and as such does not have as wide an application outside of the face
recognition field as it might. Lee and Ranganath propose “a novel, pose-invariant face
recognition system based on a deformable, generic 3D face model” in [32] which is
comparable to the system described in [23] and according to the authors is capable of a
recognition success rate of 92.3% over a test data set of 660 images.
4.2 3D Reconstruction
In order to utilize 3D data in any of the recognition systems described above it is first
necessary to somehow obtain the data. This can be obtained from a number of sources
including hardware range finders or standard image capture devices. Using standard image
capture devices our input is restricted to two dimensions, a number of techniques can
however be used to analyze the image data in order to deduce 3D information.
Systems that utilise motion cues in order to directly reconstruct 3D data are in existence but
are not appropriate or accurate enough to handle the intricacies of the human face. [49]
describes such a system that utilises motion between image frames to simultaneously
segment and produce relative depth ordering of objects in a scene. Whilst the data available
from motion cues could potentially be useful in segmentation, feature extraction and layer
recovery it is not a suitable technique for the capture of face features and as such is of little
use for any 3D surface recognition systems except perhaps as a tool for initial segmentation
processes.
A traditional and much more common approach to 3D reconstruction is represented by a

mass of stereo correspondence based reconstruction techniques. Image points are matched
Page 9 of 56
April 2004
across stereo image pairs and then reconstructed to three dimensions. The most common
class of correspondence measures are pixel based algorithms [13, 28] which compare
similarity between pixels across images in order to deduce likely matching image points. The
problem of matching 2D camera projections of real world image points across stereo image
pairs leads to a host of additional issues including input point selection and “good” match
selection. Keller conducts a comprehensive evaluation of matching algorithms and match
quality measures in [27].
As an alternative to pixel based correspondence measures feature based approaches [21, 47]
have also been considered. Here common face features are detected first and their relative
position used to calculate the head pose. Advanced vision techniques such as Gabor jet
features and bunch graph matching can be used to aid reconstruction [16]. Once the pose
has been calculated from a set of feature points it then becomes possible to synthesize the
face in any orientation. This kind of technique, when combined with a deformable model to
produce a much more accurate face description, results in some of the most accurate facial
reconstructions available to date.
A prerequisite for some reconstruction stages is camera calibration. This involves

automatically calculating properties of the stereo camera rig. Several techniques have been
proposed both for the mathematical techniques behind calibration (linear, non-linear and two-
step metods) and for obtaining calibration data (from motion, calibration patterns or directly
from a scene). Multiple stage calibration procedures which seek to minimise an error function
over time are the current norm [5, 53]. Multiple images of a calibration pattern are captured
and used as input into a constrained set of equations. An alternative to using a calibration
pattern is to perform “on the job” calibration, whilst the reconstruction target object is being
viewed. This fully automatic approach described by Maas [35] is only possible under multiple
camera geometries however “can be considered a versatile and reliable method for the
calibration of photo-grammetric systems” [35]. Since much of the accuracy of the overall
system is dependant on the quality of the calibration it is essential that this stage of the
reconstruction is accurately implemented.
A number of research papers devoted to the development of stereo vision systems discuss
problems associated with the reconstruction process [37, 59]. Zu, Ku and Chen implement a
system which utilises stereo camera input to seed a SSD intensity based correspondence
matching stage. They state that their results were “not optimal”, [59], and contained high
levels of noise in the model output citing the correspondence algorithms as the under
performing system sub-section. McLauchlan suggests a recursive approach to tackling these
problems and attempts to “develop algorithmic and statistical tools that combine data from
multiple images” [37] in order to develop scene information over a given time window. He
Page 10 of 56
April 2004
shows that the recursive approach to scene reconstruction increases system accuracy levels.
Other research by the same author details further reconstruction techniques which utilise data
over a number of image frames to recursively improve captured data [38-40].
4.3 Surface Estimation
In addition to the vast amount of literature available on the reconstruction of three dimensional
data a large amount of research has also gone into the development of algorithms to convert,
possibly incomplete, point cloud data produced by the earlier system stages into more
useable forms such as meshes or other 3D surfaces. One possible technique for
implementing this process is discussed in [14] where a technique using simulated annealing
to create an optimal surface mesh is implemented. Much more advanced techniques capable
of dealing with situations such as incomplete meshes or other errors are also available. An
example of one such technique is discussed in [12]. Here surfaces are represented
completely by polyharmonic radial basis functions (RBF). Fast methods for fitting and
evaluating RBFs have been developed which allow techniques such as this to be
implemented quickly and efficiently, this type of representation also lends itself for the efficient
processing of large data sets. Since we expect to be matching a large number of face points
it is possible that in the future a solution such as this for representing face models will be
required.
In addition to the recent advancements in mesh generation and surface reconstruction

techniques a number of algorithms developed some time ago are still proving useful. Convex
Hulls are an important topic in computational geometry and form the basis of a number of
calculations relating to mesh construction. QuickHull is a widely used algorithm for computing
the convex hull of a point set and is defined in greater detail in [7]. Delaunay triangulations
are an example of a set of algorithms that have their mathematical basis in convex hull
calculations. The Delaunay method works by subdividing the volume defined by the input
point cloud into tetrahedrons with the property that the circumsphere of every tetrahedron
does not contain any other points of the triangulation. In addition to the method described
here constraints have been developed by various authors in order to improve the triangulation
accuracy and efficiency, Kallmann, Bier and Thalmann discuss algorithms for “the efficient
insertion and removal of constraints in Delaunay Triangulations” in [26]. With the addition of a
set of constraints Delaunay triangulations are capable of generating meshes suitable for our
surface requirements. Further to this description of the Delaunay method Bourke provides an
algorithm for efficient triangulation of irregularly spaced data points in [10], Bourke’s work has
specific applications in terrain modelling however is based on the Delaunay method and as
such has relevance to the general surface construction problem.
Page 11 of 56
April 2004
Another volumetric reconstruction method that has been researched and used effectively in
past work is the marching cubes algorithm [33]. As with Delaunay’s methods, marching
cubes has been subjected to numerous modifications and algorithmic improvements [11, 50].
The basic form of the algorithm splits the dataspace into a series of sub-cubes. Eight sample
points, known as voxels, that form the sub-cube are considered for triangulation. When one
sub-cube is fully processed the algorithm moves (“marches”) on to the next sub-cube until a
complete surface has been reconstructed in a recursive fashion. The original Marching
Cubes technique “did not resolve ambiguous cases… resulting in spurious holes and surfaces
in the surface representation for some datasets”, [11], however several recent proposed
improvements deal with such cases [11, 50, 51] in order to provide more complete surface
reconstructions.
4.4 Summary
The work described above covers the majority of the techniques required in the
implementation of our stereo vision system. From calibration and correspondence to
reconstruction vast quantities of research have been carried out to achieve maximum
performance and accuracy. Some stages of the reconstruction process are now considered
solved. For example, the process of re-creating a set of points in three dimensions once we
have a suitable calibration and the locations of matching image points is trivial and only
requires manipulation of the appropriate projection equation. Other stages of the process are
not so completely understood. The correspondence problem is widely regarded as the most
error-prone stage of any reconstruction, intensity based measures tend to fail on natural
images whilst research into other methods, such as wavelet matching [9], have only been
carried out recently and still require additional development time before gaining acceptance in
the computer vision community. A number of systems, both commercial and research based,
are available to provide almost fully autonomous face scanning to a reasonable degree of
accuracy. A couple of research based face recognition systems which utilise 3D scanning
techniques as a means for input to a recognition sub-system have very recently been the
subject of much interest and it is very likely that over the next year we start to see commercial
implementations of such systems.
Page 12 of 56
April 2004
5 System Outline
In this paper we consider acquiring stereo image pairs from a set of automatically calibrated
CCTV cameras, investigating various stereo correspondence algorithms and reconstructing
the three dimensional data to produce a fully textured model of the face. Initially the 3D data
will be used to directly reconstruct the face rather than using any deformable model
techniques, although this may be a consideration for future work. The basic system outline is
detailed in Figure 1.
Figure 1: High level outline of the reconstruction system.
Initially the system has no knowledge of either the type or position of the cameras which will
be used as input to the system. The first task, therefore, is to calibrate the cameras. This
involves obtaining a set of internal camera parameters followed by a set of external
parameters. Once calibration is complete we have a set of parameters which completely
define the setup of the camera rig. The calibration parameters are used at later stages of the
reconstruction process, both to optimise the efficiency of correspondence matching and
during the actual reconstruction calculations.
Page 13 of 56
April 2004
Following calibration the next stage is to capture the input images of the data we are trying to
reconstruct and find corresponding points between the two images. Once we have a set of
matching points we can use the calibration data along with the point match data to calculate
the 3D position of each of the points for which we have a match. At this stage in the process
we have an unorganised data cloud of 3D points. In order to make the data more useful the
point cloud is transformed into a meshed surface representing the original object we are trying
to reconstruct. The final task of the reconstruction routines is to map the newly created
surface with a texture lifted from the original input images to give the reconstruction a sense
of realism. Once this stage has been reached we should be left with a surface that accurately
represents the surface we were trying to reconstruct.
Page 14 of 56
April 2004
6 Calibration
In order for it to be possible to reconstruct a scene from a stereo image pair it is necessary
that several important properties about each of the cameras be known. Obtaining values for
these properties is known as calibration. Techniques for camera calibration loosely fall into
three categories linear, non-linear and two-step. Linear techniques assume a simple pinhole
camera model [52] and do not account for lens distortion effects which turn out to be
“significant in most off-the-shelf charge coupled devices” [41]. In non-linear methods a
relationship between parameters is established and then an iterative solution is calculated
through minimisation. Many early vision systems used this non-linear technique and have
since been modified to take into account camera lens distortions, however in order for the
minimisation to function correctly good initial estimations must be made of the camera
parameters to avoid converging to an incorrect solution. Finally, two-step techniques use a
combination of linear and non-linear methods to find a direct solution to some parameters and
iteratively estimate others. The final method is the most commonly implemented solution at
present.
6.1 Intrinsic and Extrinsic Parameters
In order for reconstruction to take place we effectively need to be able to translate from the
image co-ordinates as seen by the camera system into real world 3D co-ordinates. The co-
ordinate systems that we need to translate between are related by two sets of parameters,
intrinsic and extrinsic. Camera calibration is an optimisation process where observed image
features and their theoretical positions are minimised with respect to these parameters. The
intrinsic parameters are determined by the optical and digital sensors in each camera. The
parameters determine the prospective projection of a three dimensional point onto a two-
dimensional image plane. The required variables for each camera are the focal length, the
effective pixel width and height and the principle point of the camera. The extrinsic camera
parameters consist of a 3X3 orthogonal matrix and a translation vector describing the
transformation required to move from one co-ordinate system to the other. Essentially
calibration is the process of calculating two matrices which fully represent both the internal
and external parameters of the cameras being calibrated. The two matrices which require
calculation take the following forms:
fs x 0 ox r11 r12 r13 t1

M int = 0 fs y oy M ext = r21 r22 r23 t2
0 0 1 r31 r32 r33 t3
In order to obtain these values in a straightforward manner it is possible to utilise a calibration

pattern with known control point positions. Whilst there are a number of possible calibration
Page 15 of 56
April 2004
patterns the simplest to use involves a planar calibration pattern (a chessboard) in order to
perform the calibration procedure. When all the chessboards squares are visible to both
cameras it is possible to capture a stereo image pair, analyse both images with a corner
detector and begin making deductions about camera parameters. This is possible since we
know the physical properties of the chessboard and can deduce its position and rotation
entirely from the relative location of the internal board corners (the control points). Utilising
this data from both cameras it then becomes possible over a sequence of frames to further
deduce the relative position and rotation of the two cameras (and hence the extrinsic
parameters) as well as the focal length, pixel width, height and principle point (and hence the
intrinsic parameters). Once this process if finished the calibration stage is complete and the
obtained data can be reused whilst the camera setup remains the same.
6.2 Parameter Estimation
Our implementation of the calibration stage of the system relies heavily on a number of
functions found in the Intel OpenCV image library. A multi-plane approach to calibration
which relies on techniques drawn from photogrammetric and self calibration approaches is
described below and is the basis of our calibration functionality. A planar calibration pattern is
used since rather than other alternatives this can be printed on standard paper and fixed to a
rigid object rather than a more complicated construction being required. Initially multiple
images of the calibration pattern are captured and control points in each of the images are
found. The algorithm in question is based on a homography which maps points on one plane
to points on another plane using a linear transformation. The following description of the
calibration routines is based on that of [55]. To begin we must consider the following
definitions:
m = [u , v]T ~ = [u , v,1]T
m
M = [ x, y , z ]
T
~
M = [ x, y, z ,1]T
A = M int
Where m is a 2D co-ordinate, M a 3D co-ordinate and s some arbitrary scale factor. The

cameras projection equation can therefore be written as:
~
~ = A[ R, T ]M
sm
This approach is based on the first fundamental theorem of projective geometry which states
“There exists a unique homography that performs a change of basis between two projective
spaces of the same dimension” [17]. Thus given any plane in world space, there is a
mapping between the plane and any additional images of it. This mapping is defined up to
the scale factor s and can be derived through the expansion of the camera projection
equation detailed above. Expanding the projection equation gives us:
Page 16 of 56
April 2004
s (u v 1) T = H ( x y 1) T
and thus, a point on the image plane is mapped to the world plane with:
~
~ = HM
sm
H in the above equation is a homography. Homographies can be estimated from four points
and over a sequence of frames it becomes possible to build a system of homogenous
equations such that we can estimate both the intrinsic and extrinsic parameters of the
cameras and hence calculate our calibration matrices for later use.
From the above equation we can expand to:
(h1 h 2 h3) = sA(r1 r 2 t )
Re-writing the homography equations in column form:
h1 = sAr1
h2 = sAr2
h3 = sAr3
Some basic constraints on the camera parameters can now be calculated as follows:
t
ri r j = 0
T T
ri ri = r j r j
It then becomes possible to derive two basic constraints on the parameters we are trying to
obtain:
h1 = sAr1
1 −1
A h1 = r1
s
1 −1
A h2 = r2
s
T
r1 r2 = 0
T
h1 A −T A −1 h2 = 0
T T
r1 r1 = r2 r2
T T
h1 A −T A −1 h1 = h2 A −T A −1 h2
Page 17 of 56
April 2004
T
The two constraints are represented by h1 A −T A −1 h2 = 0 and
T T
h1 A −T A −1 h1 = h2 A −T A −1 h2 . Two constraints are required since there are 6 degrees of
freedom for the extrinsic parameters. For each known homography we can therefore obtain
two constraints on the five intrinsic parameters. Hence three or more homographys are
required to determine the intrinsic parameters.
A closed form solution of the camera calibration is therefore:
B is in the form of a vector containing 6 parameters. Since B is symmetric then the two
constrains for the intrinsic parameters can be used to build a system of homogenous
equations:
T T
hi Bh j = vij
 vij
T

 
 (v − v ) T b = 0
 11 22 
For each image in the calibration it is possible to stack a corresponding equation into the
equation above and thus solve for b. Once b has been obtained we can solve for the intrinsic
parameters.
6.3 Calibration Testing
Figure 2 shows the calibration process in progress. The left and right cameras in the stereo
rig capture simultaneous input of the calibration pattern. The two images then undergo
thresh-holding in order to deduce the chess board corner positions. As can be seen in Figure
2 the corners of each square on the board is marked and matched to the corresponding
corner on the opposite input image. The order of the points as they appear in both images is
used as a constraint to ensure that each square corner is matched correctly.
Page 18 of 56
April 2004
Figure 2: Calibration Dialog Screenshot
In order to test the calibration algorithms it is necessary to attempt a calibration on a series of

images for which we already know the calibration results in order to compare our calculations
for accuracy. Since it is difficult to obtain actual data regarding intrinsic parameters and
similarly difficult to make measurements relating the position of the two cameras in the stereo
rig the calibration routines were tested using a sequence of synthesized images for which the
actual calibration parameters were known.
The results obtained from the Left Camera Right Camera

test calibration sequence are Intrinsic Parameters Intrinsic Parameters
771.642 0 322.097 770.628 0 322.251
shown in Table 1. Since the
0 772.772 239.779 0 773.177 239.432
cameras are virtual, and the 0 0 1 0 0 1
left camera is an exact copy Translation Translation
of the right camera it can be -15.984 -14.67 66.785 3.3 -14.635 66.955
Rotation Rotation
assumed that the intrinsic
0.961 -0.019 0.276 0.961 -0.02 0.277
parameters of both cameras -0.026 0.987 0.159 -0.026 0.987 0.161
will be identical. It can be -0.276 -0.16 0.948 -0.276 -0.162 0.947
seen from the results that the Table 1. Calibration Test Sequence Results
calibration procedure produces almost identical values for left and right camera intrinsic
parameters. Since the values between left and right cameras are almost equal it can be
assumed that the automatic detection of these parameters is being performed correctly.
Since the calibration can be viewed as a process of constrained optimisation, over a
sequence of input images, it can be assumed that the slight errors in the calculations are a
result of either slight discrepancies in these optimisations, inaccuracies in the chess board
corner finding algorithms, or since the input images were rendered using JPEG compression,
a result of image quality degradation. Despite these minor inaccuracies the calculation of
intrinsic parameters appears to be correct.
Page 19 of 56
April 2004
The geometry of the test camera rig was such that both cameras lay on the same position of
the vertical axis and were positioned the same distance from the target object (equal positions
on the z axis). The only displacement between camera positions was on the horizontal axis.
Figure 3 shows a graphical representation of the camera rig.
Figure 3: Artificial Test Rig Camera Configuration
Under these conditions it can be seen that the extrinsic parameters automatically calculated
for this rig are correct. The translation vector that is calculated indicates that the only
translation between cameras is on the horizontal axis. Furthermore, it should be noted that
the original horizontal axis displacement of the two cameras was 20 units, where as the
calibration finds this displacement to be 19.284 units. This demonstrates a satisfactory level
of accuracy throughout this calibration process however appropriate tests and research
should be carried out regarding none parallel camera geometries and which geometries have
the most positive effect on the reconstruction results. The calibration process was tested
using camera captured images, and returned results closely approximating those that would
be expected (i.e. a logically correct ratio between x, y and z displacement and approximately
correct rotation matrices) however no precise physical calibrations of our stereo rig were
possible and hence it proved difficult to test the performance under actual calibrations with no
correct results for comparison. Despite this, observable evidence and results suggested the
calibration procedures detailed in the section function in a satisfactory manner.
Page 20 of 56
April 2004
7 Rectification
Implementations of correlation algorithms required in the next stage of the reconstruction

process can be greatly simplified if the input images can be rectified. The process of
rectification involves a 2D image transformation of the input images such that corresponding
image points are located on equivalent image scan lines. Utilising geometric properties
inherent to epipolar geometry (see Figure 4), given a point and its projected location on one
image plane, it is possible to calculate on which epipolar line in the other image plane the
point will appear. This epipolar constraint allows us to calculate and perform the rectification
2D transformation of the original input images.
Figure 4: Graphical representation of epipolar geometry.
The epipolar constraint expresses the relation between two images of the same scene. The
plane marked by COP1, COP2 and P, shown in Figure 4, represents the epipolar plane. The
intersection of this plane with the two image planes represents the epipolar lines.
Figure 5: A Rectified Input Image Pair
Page 21 of 56
April 2004
The effect of the rectification is such that the correspondence problem is reduced to one
dimension since we only have to search for matching points across a single horizontal line of
the matching input image. Figure 5 shows the results of rectifying some input images after
calibration of a stereo rig and the capture of a stereo image pair. Analysis of the rectified
image pairs shows that matching points are indeed positioned on matching scan lines
showing this to be a useful rectification. With the rectification of the input images complete it
then becomes possible to begin attempting additional stages in the reconstruction process.
Page 22 of 56
April 2004
8 Correlation
In order to calculate the depth of a point in the scene we have to find points in both the left
and right camera images which represent the same real world co-ordinate. Perhaps the most
important contributing factor in terms of the accuracy of the final reconstruction is a systems
ability to find a comprehensive solution to this problem. There are vast arrays of available
correlation algorithms including local window based methods [36] and feature point based
techniques [2]. A number of the other available methods for matching points between images
are discussed in [31] by Laganiere and Vincent. Since initially we do not know where we
might find a correlating image point the search space for matching a point is relatively large.
In order to constrain the size of the search the left and right camera images can be rectified.
This process involves rotating left and right camera images in accordance with parameters
obtained at the calibration stage. The rectification of the images ensures that matching points
on both images can be found on identical raster scan lines in both images. This causes a
large improvement in the performance of the point matching algorithms since the correlation
search space can be reduced to one dimension.
8.1 Input Point Detection
Before we consider correlation it should be obvious that we need to select a set of feature
points which we are going to attempt to match. This is not as trivial a problem as it might first
seem. A good selection of input points makes finding correspondences easier. The term
used to define the suitability of a point for matching is known as its saliency. Hall, Leibe and
Schiele state that the saliency of an image feature can be defined to be “inversely
proportional to the probability of occurrence of that image feature”, [22]. These authors
continue to create a formal definition of saliency in the early part of their paper and go on to
show that good candidate input points are usually those with high saliency. Sebe and Lew
provide a good comparison of a number of salient point detectors in [46] including
comparisons of the Harr feature detector, the Harris feature detector, random point selection
and others. The authors also propose a method based on analysis of the image using
wavelet decomposition. A number of these feature point selectors will be implemented and
compared within our vision system.
8.2 Intensity Based Pixel-wise Correlation
The simplest correlation algorithms rely on local window based matching techniques which
consider the “similarity” of a local window surrounding a potential point correlation. The
easiest similarity measure to implement would be one that, given a pixel to match, simply tries
to find a pixel with the same colour and intensity in the matching input image. This technique
will obviously have problems when there is one or more pixels with similar or identical
properties in the match search space. An advance on this technique is to also consider the
Page 23 of 56
April 2004
values of pixels surrounding the pixel that we are trying to match, in this manner we should be
able to differentiate between the pixel we are looking for and pixels which are simply similar in
colour and intensity. Pixel-wise image correspondence methods were among the first used to
attempt to solve the stereo correspondence problem.
Initially two different intensity based correlation matching algorithms were considered. The
first is the sum of squared difference (SSD) similarity measure which calculates the difference
between pixels in an image window on each part of the image pair and then sums these
differences to decide on how similar two given regions are. The second algorithm
investigated is the zero mean based cross correlation algorithm (ZMNCC), which attempts to
compensate for differences in average intensity across image pairs whilst calculating
matching points.
8.3 SSD
The SSD algorithm is defined as follows:
where (2W+1) is the width of the correlation window. Il and Ir are the intensities of the left and
right image pixels. [I, j] are the coordinates of the left image pixel.
The following definitions complete the algorithm:
where the first statement is the relative displacement between left and right image pixels and
the second statement represents the SSD correlation function.
This algorithm functions by assuming that correlating image points will be surrounded by a
window of other image points which when subtracted from their respective pixels in the
matching correlation window can then be squared and the results summed to measure the
similarity of the two points at the centre of each window.
8.4 ZMNCC
The alternative to the SSD similarity measure algorithm that is considered here is the Zero
Mean Normalized Cross Correlation Algorithm. This algorithm is defined as follows:
Page 24 of 56
April 2004
Here fl and fr represent vectors containing the intensity levels of pixels in the left and right
correlation windows.
The ZMNCC algorithm subtracts the average intensity of each correlation window from the
pixels within that correlation window before computing point similarity from the intensity
vectors. This is in an attempt to compensate for consistent changes in intensity surrounding
points that may occur between images in a stereo pair due to scene illumination, light source
direction or a number of additional factors.
8.5 Correspondence Testing
In order to test the performance of the two algorithms a common stereo image pair was
selected. One of the input images from the pair is shown to the left of Figure 6. The actual
disparity map of the scene, which was computed by a laser range finder, is shown on the right
of the same figure.
The difference in position between corresponding points can be used to represent the
intensity of a pixel at any location in order to produce a disparity map representing the depth
of objects in a scene. Whilst this is not a full 3D reconstruction it is the first step and the
quality of a disparity map is usually representative of the effectiveness of a point matching
algorithm. The output of the SSD similarity measure and the ZMNCC algorithm is shown in
Figure 7.
Figure 6: Stereo pair input image (left) and ground truth disparity data (right).
Page 25 of 56
April 2004
As expected, neither of the algorithms produce perfect results. A fairly large number of points
are incorrectly matched to their corresponding points. This is especially apparent around the
edges of objects and in areas of similar or low texture. Reasons for these errors include,
insufficient differences between image window intensities, illumination differences, image
noise or occluded points. Both algorithms do, however, produce recognisable output and the
depths of the majority of the scenes objects can be observed in the resultant disparity maps.
Figure 7: SSD (left) and ZMNCC (right) disparity maps..
Furthermore, in order to produce the disparity map an attempt was made to match every
single pixel in one image to an appropriate pixel in the opposite image. Matching every pixel
will not be a priority when we are attempting to reconstruct the surface of the face, and as
such we need only select strong candidates for a match and then interpolate depths between
these matched points. This should give us a higher rate of accuracy during correspondence
than is evident in the disparity maps of Figure 7.
Page 26 of 56
April 2004
8.6 Matching Constraints
It can be seen from the results that whilst in simple cases pixel-wise correspondence
matching is capable of finding corresponding points we can expect a number of erroneous
correspondences to be found. Depending on the severity of the mismatches this could have
drastic effects on our final model if we are not able to discern which of our correspondences
are likely to be correct and which are likely to be errors. In order to make this decision it is
possibly to impose a number of constraints on the matches in order to eliminate erroneous
points. The constraints which are likely to have the most positive effect are those that make
assumptions about the world we are viewing and hence are able to determine which matches
violate properties we expect to observe in the data we are viewing.
Constraints appropriate for pixel-wise correspondence techniques include the following:
• Similarity: For intensity based approaches, the similarity of two corresponding points
is completely defined by some measure of how similar a set of pixel intensities are. It
is possible to eliminate some weak matches by specifying a minimum match strength
threshold under which matches are marked as invalid.
• Uniqueness: Almost always a given pixel should match at most one corresponding
pixel in the match image. Occluded points and partially transparent objects can
violate this constraint.
• Continuity: A property of most natural objects (including the human face) is that the
disparity of matches should vary smoothly over the object.
• Ordering: The order of points on the original image will, almost always, be preserved
in the matching image. This constraint fails when points lie in what is known as “the
forbidden zone.”
• Left / Right Consistency: A point that is matched from the left to the right image
should be in the same location if the point is matched from the right to the left image.
• Statistical: Assuming a certain distribution for the reconstructed points can help to
remove false matches. For example removing points that fall outside of the standard
deviation of a point cloud can help eliminate spurious matches is we expect points in
the cloud to be normally distributed.
8.7 Constraint Testing
It should be noted that under certain conditions any or all of these constraints can fail to
eliminate incorrect matches and / or eliminate correct matches. However it has been proven
that the introduction of constraints into a system does serve to reduce the amount of
incorrectly matches points. A disadvantage of combining many constraints is that this
“results in more thresholds, and thus a greater need for tuning” [31]. This does remove a
Page 27 of 56
April 2004
degree of automation from the process, however, it appears to be an essential step in the
absence of any perfect matching strategies. Figure 8 shows the potential importance of the
implementation of constraints. The left image shows the results of a reconstruction with no
constraints applied where as the right image demonstrates the improvements to the point
cloud results when constraints are applied. The initial correspondence match produces a
couple of matching errors and the reconstruction of the 3D points (discussed in the following
section) shows a group of points to one side of the reconstruction and a pair of mismatched
points a long distance from the actual model. This sort of matching error leads to the
production of an unsuitable model, however, with the addition of some simple constraints to
the matching process a much improved starting point for model generation is created.
Figure 8: Effects of constraint application
The development of the constraints goes further to show that the process of stereo
reconstruction is one of constrained optimisation. We can never produce perfect calibrations,
point matches and reconstructions from actual imagery and hence the best results are
produced when we can constrain our results to such an extent that the majority of errors are
eliminated or reduced to the point that they do not produce observable errors in our final
output.
8.8 Alternative Correspondence Measures
Despite the introduction of constraints to the matching process, the inaccuracies of the
algorithms discussed so far will have a fairy major impact on the final output of the system.
The difficulty of finding correct correlations on real imagery is probably a problem which
intensity based matching algorithms cannot fully overcome. Differences between stereo
image pairs due to illumination, point occlusion and image noise can be such that matching
points based on the intensity of a window surrounding those points cannot yield a high rate of
accuracy. Solving this problem has been the subject of much research with the most
promising results coming from investigations into the use of wavelet decomposition of the
input images in order to find points of interest and also the corresponding point match. [9, 48]
discuss this approach in more detail. Their research is based around multiscale matching
Page 28 of 56
April 2004
techniques which seek to decompose an image into several parts containing copies of the
image under certain scale changes and gaussian filtering [44]. Utilising a dyadic wavelet
transform as the basis for a matching algorithm Bhatti claims to demonstrate “the viability of
applying the wavelet transform approach to stereo matching”, [9]. As such this appears to be
a promising direction in which to take future development of more robust, automatic and
accurate correspondence matching algorithms.
Page 29 of 56
April 2004
9 Projective Reconstruction
The initial task of the reconstruction stage is to calculate the 3D position of the correlated
image points. In order to achieve this calculation data from all the previous stages is utilized.
Both the cameras internal and external parameters are utilized along with a vector of co-
ordinates matching corresponding points between the left and right input images. In order to
extrapolate the x, y and z real world co-ordinates an over constrained set of linear equations
must be solved. Since the system of equations is over constrained it is necessary to obtain a
best fit least squares estimation of the results. The following calculations demonstrate the
algebraic reconstruction of the corresponding image points:
Let (c' l r 'l 1) T and (c' r r'r 1) represent the corresponding image points on the left
and right rectified input images. The original non-rectified image co-ordinates can then be
recovered from the rectified matching co-ordinates:
−1
(c ' l r 'l 1) T = W 'l RrectWl (cl rl 1) T
−1
(c ' r r 'r 1) T = W ' r RrectWr (c r rr 1) T
Where W is the 3X3 matrix representing the left or right cameras intrinsic parameters (Mint in
the calibration section) and M represents the cameras extrinsic parameters (Mext in the
calibration section).
The perspective projection equation is defined as:
λ (c r 1) T = WM ( x y z 1) T
The combination of the above equations then yields the following:
λl (c ' l r 'l 1) T
−1
= W 'l RrectWl (cl rl 1) T
−1
= W 'l RrectWl Wl M l ( x y z 1) T
= W 'l Rrect M l ( x y z 1) T
and
Page 30 of 56
April 2004
λ r (c ' r r 'r 1) T
−1
= W ' r RrectWr (c r rr 1) T
−1
= W ' r RrectWr Wr M r ( x y z 1) T
= W ' r Rrect RM r ( x y z 1) T
This leaves us with five unknowns (x, y, z, λl and λr) for which we have six equations. As
stated above the solution can be obtained using the least squares method.
If,
Pl = W 'l Rrect M l and

Pr = W ' r Rrect RM r
then:
Pl ( x y z 1)T − λl (cl rl 1)
 pl11 pl12 pl13 Pl14 
 
=  Pl 21 Pl 22 Pl 23 Pl 24 ( x y z 1)T − λl (cl rl 1)T
P Pl 32 Pl 33 Pl 34 
 l 31
= (0 0 0)T
Pr ( x y z 1)T − λr (cr rr 1)
 pr11 pr12 pr13 Pr14 
 
=  Pr 21 Pr 22 Pr 23 Pr 24 ( x y z 1)T − λr (cr rr 1)T
P Prl 32 Pr 33 Pr 34 
 r 31
= (0 0 0)T
These two equation systems can be combined to help us obtain the final solution:
 Pl11 Pl12 Pl13 − cl 0   − pl14 

  x   
 Pl 21 Pl 22 Pl 23 − rl 0    − pl 24 
P  y
Pl 32 Pl 33 −1 0    − pl 34 
 l 31  z = 
 Pr11 Pr12 Pr13 0 − cr    − pr14 
P λ 
 r 21 Pr 22 Pr 23 0 − rr  l   − pr 24 
λ
P
 r 31 Pr 32 Pr 33 0 − 1  r   − pr 34 
Page 31 of 56
April 2004
The least squares solution to the linear equation system AX = B is given by:
X = ( AT A) −1 AT B
Solving this system for each of our corresponding image points allows each of the points to
be projected back into three dimensions. Whilst the linear system has to be solved for each
point many of the values in the system remain constant whilst a consistent calibration is
maintained, and hence the results can be calculated more efficiently if we do not recalculate
all values in the system for each point we are reconstructing.
9.1 Reconstruction Testing
In order to test the correctness of this reconstruction technique a scene with known 3D
parameters was created and fed as input into the system.
Figure 9: 3D Studio Max cube reconstruction. Test input (purple cube) and reconstructed
output (red spheres) shown from left, right, top and perspective views.
The system was also calibrated using generated images to ensure that errors in the
calibration stage are kept to a minimum and correlating points were manually matched to
ensure that any errors in the results were not due to erroneous point correlations. The virtual
Page 32 of 56
April 2004
stereo rig was identical to that shown in Figure 3. A cube was chosen as input for the test
reconstruction since only a small number of points need to be considered.
The test data was produced and rendered using Discreet’s 3D Studio Max [15]. In order to
test the validity of the results the calculated 3D co-ordinates were fully reconstructed and then
imported back into the original 3D Studio Max scene and compared with the original 3D
model. Figure 9 shows the results of importing the reconstructed data set into 3D Studio Max
with the cube representing the original data and the spheres showing the locations of the
reconstructed points.
As can be seen from the Original Co-ordinates Reconstructed Co-Ordinates

output, the system produces (5, 11, 20) (4.362, 10.943, 20.157)
results which are almost (5, -14, 20) (4.695, -13.404, 20.039)
(-20, 10, 20) (-20.061, 10.671, 20.014)
identical to the initial 3D
(5, -14, -4) (4.888, -12.994. -4.375)
points with the centre of (-20, -14, -4) (-19.351, -13.27, -4.68)
each sphere falling almost (-20, 11, -4) (-20.299, 10.597, -4.014)
exactly on the vertices of the (5, 11, -4) (4.566, 11.256, -4.056)
Table 2: Reconstructed Co-ordinate Comparison
original cube. Table 2
shows the reconstructed co-ordinates compared to the original cube co-ordinates. The
reconstructed points all fall within one unit of their original position. This demonstrates a low
level of error, which is most likely due to slight inaccuracies in manual correspondence
matching, lack of input resolution or image compression artefacts. Thus the reconstruction
algorithms exhibit a satisfactory level of accuracy.
Page 33 of 56
April 2004
10 Surface Estimation
Once the projective reconstruction procedures have been carried out our results take the form
of an unorganised three-dimensional point cloud. In order to continue with the reconstruction
it is necessary to estimate properties of the original surface and hence decide which points
from the cloud should be interconnected. The choice of polygon with which we will attempt to
construct our surface from is not of great importance as long as the shape can correctly
represent the surface we are attempting to reconstruct. Since the majority of 3D rendering
algorithms are optimised for dealing with meshes constructed from triangles and a number of
suitable surface construction algorithms generate their output as a list of connected triangles
this is the most suitable mesh construction primitive. The process of triangulation “involves
creating from the sample points a set of non-overlapping triangularly bounded facets”, such
that, “the vertices of the triangles are the input sample points” [10]. Whilst there are a number
of algorithms readily available for triangulation “the more popular algorithms are the radial
sweep method and the Watson algorithm which implement Delaunay triangulation” [10]. A
high quantity of surface construction algorithms are based on Delaunay triangulations. These
triangulations have been the subject of much research aimed at optimising and constraining
the original algorithm to achieve fast and accurate surface representations. For the purpose
of reconstructing our face surfaces, Bourke’s modification of Delaunay’s method is suitable for
out irregularly spaced dataset. A more detailed description of Bourke’s triangulation
technique is available in [10] however he summarises the work as follows:
“The Delauney triangulation is closely related geometrically to the Direchlet tessellation also
known as the Voronoi or Theissen tessellations. These tessellations split the plane into a
number of polygonal regions called tiles. Each tile has one sample point in its interior called a
generating point. All other points inside the polygonal tile are closer to the generating point
than to any other. The Delauney triangulation is created by connecting all generating points
which share a common tile edge. Thus formed, the triangle edges are perpendicular bisectors
of the tile edges.”
At this stage it is impossible to tell which surface construction algorithm will prove most
suitable for recognition tasks. Since Bourke’s method provides visually satisfactory results a
slightly modified implementation of his code has been used, however, as with all other
algorithms that have been implemented within the system it is easily possible to implement
additional algorithms should Bourke’s method not provide satisfactory recognition and
reconstruction results at a later stage.
10.1 Surface Estimation Testing
In order to test the quality of the mesh construction, initial tests were carried out using the
same artificially rendered cube as was reconstructed in figure 8. Once a surface has been
Page 34 of 56
April 2004
created it then becomes trivial to fill the surface and calculate lighting effects to add a degree
of realism to the scene. Figure 10 shows the reconstruction and meshing process on the
cube data.
Figure 10: Reconstructed Cube output after mesh triangulation and surface construction.
Inspection of the output model shows the triangulation method provides accurate results
under simple conditions. Further testing was also carried out on a more complicated
reconstruction, where testing was carried out under identical conditions except for the cube
model being replaced by a head mode. The reconstruction of this more complex model is on
display in Figure 11 and for the number of input points provides a good reconstruction of the
original head model.
Figure 11: Original face rendering (left) and the reconstructed mesh (middle) along with a full
surface reconstruction (right).
The texture displayed on the original face image was chosen to make the point matching
stage as simple and error free as possible, with matching points being selected on the
chequered pattern vertices. Since a relatively low number of points were matched across the
input images the resultant model is of a relatively low resolution, however, increased
resolution is simply a matter of matching a greater number of input points. Satisfactory
surface models are obtained using the described method, however Bourke’s implementation
Page 35 of 56
April 2004
is somewhat simplistic despite creating an accurate mesh. In order to create more accurate
models more sophisticated meshing algorithms are required. Better surface reconstruction
algorithms would include features such as the smoothing to help reduce the effect of noisy
model data or better interpolation between data points to allow the creation of correctly curved
face surfaces. Finally, as an alternative to surface estimation, deforming a generic head
model would mostly eliminate the need for any surface reconstruction and should be
considered for future work.
10.2 Texture Mapping
The final stage of the reconstruction process involves applying a texture to the surface model.
Since we have already have the 2D and 3D co-ordinates for each of the points in the
reconstruction texture mapping simply involves extracting the 2D texture data from the input
images and applying it to the corresponding surface on the 3D model. More sophisticated
techniques that could use textures from both input images are possible and would improve
the final output, however, at this stage taking the texture from a single input image and
applying it to the 3D model provides satisfactory results. Figure 12 shows the reconstruction
from Figure 11 after an appropriate texture has been applied.
Figure 12: Texture mapped model reconstruction
Whilst the reconstruction process involves a number of steps and a large amount of
calculations the volume of research carried out on the subject has led to the development of
many techniques which prove suitable for out application. The actual 3D point calculations
can be considered correct since the perspective projection equations have been well
researched and have been successfully applied to a multitude of applications. The surface
meshing algorithms appear to be successful at this stage, however, only further testing will
prove whether they will be successful when used in more complex, non-simulated, situations.
Finally the approach to texture mapping is successful in producing a more realistic output and
complements the mesh in providing a realistic reconstruction.
Page 36 of 56
April 2004
11 Implementation
Major goals of the implementation of this stereo vision system include:
• The development of a complete system capable of reconstructing the visible surface

of an object given left and right images of a stereo pair and calibration data regarding
the camera rig under which the stereo pair was captured.
• The system should be able to obtain the camera calibration data automatically from a
sequence of calibration images.
• An implicit assumption of the implementation should be that the specific algorithms
which have been selected may not be ideal and hence the application framework
should be developed in a manner which allows future expansion and fast integration
of improved algorithms.
• A degree of platform independence should be achieved.
• The system should be geared towards producing output which provides a suitable
basis for input into a pose-invarient face recognition system.
11.1 Design Choices
Initial choices of development language, target platform and development environment were
made easily. C++ is currently the most widely used application development language and as
such has a wide variety of libraries whose power can be harnessed during the development
phase. Many of the calculations which we are required to carry out have been implemented
many times over by a variety of authors. For example, the linear algebra and least squares
estimation which we require in the reconstruction stage need not be re-programmed since
libraries such as LAPACK [4] (amongst others), handle this sort of calculation efficiently.
Examples of other useful libraries include OpenCV [25], Intel’s Image Processing Library (IPL)
[24], and a number of others aimed squarely at developers creating computer vision
applications. A number of these libraries are available only in C++ and as such the choice of
C++ as our development language was heavily weighted by the existence of such libraries
thus giving us more scope when deciding how various features should be implemented. In
addition C++ is considered faster than alternatives such as Java under intensive processing
conditions. Also during early development stages C# was still a mostly unproven technology
and hence was not a primary consideration for this implementation. A more complete
description of the libraries which have been used can be found in the software libraries
section of this report.
Whilst one of the goals of the implementation is to provide a system which maximises the
degree of cross-platform compatibility, it should be noted that a relatively extensive GUI
Page 37 of 56
April 2004
interface to handle the display of large quantities of data in a graphical manner is required
and hence high compatibility is difficult to achieve. However, in order to combat this difficulty
all algorithms which play an important part in the reconstruction (SSD cross-correlation,
ZMNCC and Delauney etc.) have been implemented in a library separate from the user
interface code. This achieves a software separation between vision code and GUI code and
hence whilst the interface is application and platform specific the actual vision algorithms
retain a degree of cross-compatibility and re-useability. The current implementation contains
all the vision code in a library named “VisionLib” and compiles the code to a dll file enabling
other applications to use the latest version of the library without storing multiple versions of
the compiled code on a single computer. The main application is called “FaceScanner” and is
linked at compile time to the VisionLib library.
11.2 Application Architecture
The initial implementation of the system is compiled for the windows platform. Since
Microsoft’s Visual Studio is the IDE of choice for programming C++ under windows this
software was selected for development and hence it was logical to utilise the visual
development features of the environment. To this end the application makes extensive use of
Microsoft Foundation Class (MFC) functionality. Furthermore the Microsoft MDI Document /
View application architecture has been adhered to. In essence this application architecture
divides and application into a document part (derived from the Document super-class) which
contains all user data pertaining to the contents of the work space. In our case this is the
currently matched points, current matching algorithm selection data, calibration information
and all the rest of the data associated with the reconstruction currently being worked on. The
document section of the application has a close association with the vision library since the
data stored in the document must be manipulated by the algorithms available in the VisionLib
dll. The second part of the architecture contain a series of “views” on the document which
display and optionally edit parts of the data stored in the document. The views are derived
from the FormView super class. Each view is associated with a document and defines the
way in which the user is presented with the information contained within the document.
Examples of views in our application are the raw data viewer, shown here in Figure 13, the
3D data viewer and the input image view amongst others. An advantage of this approach is
that as functionality is added to the application, new views can be created in order to allow the
user to interact with the new functionality and data as easily as possible. Additionally more
views can be added at a later date to allow new ways for data to be entered into the program
and new ways for the data to be examined. For example, it would be trivial to add a view
which enabled the user to select point matches by hand without changing any of the code
already in place. Since the new view would simply make changes to the data structures
already available in the document all previous views would already be equipped to deal with
Page 38 of 56
April 2004
displaying the data in a manner appropriate to that view. This advantage is crucial in for-filling
the implementation requirement that we make no assumptions about the specific algorithms
which will be in place in the final application.
Figure 13: The raw data view displays actual pixel co-ordinates of the point matches,
reconstructed points, normalised model co-ordinates and raw calibration data.
Data structure design is one of the most important aspects of any application but plays a
particularly important role when applied to graphically intensive applications. Much of the
reason for this lies in the complexity of many graphics objects. Not only does the data need
to be stored but it also needs to be processed quickly leading to the requirement that all data
structures should be optimised for fast processing. This can lead to some difficulties in
designing appropriate data structures as many require constant redesign to cater for
additional algorithms processing requirements. Apart from fundamental data objects such as
images (which are handled by the IplImage data structure, a part of the IPL image library)
custom data structures were implemented for the majority of objects. Readers should refer
directly to the code to observe implementation specific data structures. It should be noted
however that a number of libraries exist aimed at providing many of the data structures which
have been implemented. To this end it probably would have been more efficient to use some
of these data structures “out of the box” rather than relying on custom implementations which
required constant rewrites as the application grew. Despite this possible oversight the data
structures in place serve to adequately represent the user data and perform processing in a
timely manner suggesting that they are indeed suitable for the task in hand.
The main application structure is shown in the simplified UML class diagram of Figure 14.
The FaceScannerApp class is the main class of the application and brings together all other
elements of the program including the document and the available views. The document
Page 39 of 56
April 2004
object is perhaps the most important in this structure since this is where all data on the
current reconstruction is held. In general the document holds all the objects defined in the
VisionLib library relating to the reconstruction. Objects representing the calibration, the image
pre-processing algorithm, the point input algorithm, the point match algorithm, the surface
reconstruction algorithm, a list of applied constrains and the input image data are all stored in
the document with data regarding actual point matches and current 3D points etc. All the
data types for these objects are defined in the VisionLib library and therefore are reusable in
alternative applications. Furthermore, through support provided by the MFC document / view
architecture it becomes simple to save the state of a reconstruction by serialising the
document object and thus saving the state of the currently active algorithms for reloading in a
future session. Also since we are using the multiple document / view version of the
architecture it is possible to work on multiple documents (and hence multiple reconstructions)
in the same workspace, thus allowing direct comparison between different algorithms and
datasets. A final advantage of grouping the majority of the reconstruction specific data into
the same document object is that this constrains interaction with the external VisionLib library
to a single object and thus reduces the complexity of the interaction between the two separate
components of the application.
ViewAlgs ViewImages
OnButtonAlgSettings
OnButtonAlgUpdate OnRadioProcessed()
OnRadioRectify() ViewData
OnRadioInput()
OnButtonCapture() createDataStrings()
OnButtonSave() displayDataStrings()
Doc
constraints2D
constraints3D ViewCalib
calibration
VisionLib nFrames : int
preproAlg
nInterval : int
inputAlg View
matchAlg
OnButtonCalibrate()
reconstructionAlg
OnButtonSave()
camerasReady : bool
OnButtonLoad()
ptrMatches
CalibratePair()
ViewRender3D
OnMouseEvent()
FaceScannerApp View3D initGeometry()
drawGLScene()
MainFrame drawPoints()
InitInst ance() OnCheckPoints()
drawMesh()
OnAppAbout() OnCheckMesh()
drawTexture()
OnCheckFill()
drawFill()
OnCheckTexture()
CreateGLContext()
OnCheckLights()
CalcVectorNormal()
GetRenderView()
MDIChildFrame
MFC
. .. ViewFrameContainers.. .
Figure 14: Simplified UML diagram of the user interface / MFC portion of the FaceScanner
Application. Some fields and methods have been omitted for conciseness.
Page 40 of 56
April 2004
The inclusion of a variety of views on the document object allows the data contained within
the document to be presented to the user in a manageable fashion. Each view is tailored to
displaying portions of the document data to the user in a unique manner. Each view
optionally provides the user with ways to interact with the data. In essence the above
completely defines the main application structure, ie. a document and some views. User
interaction with the views triggers application events which interface with the vision library
code to manipulate the document data and thus guide the reconstruction process.
Figure 15 shows the more complex class interactions found within the Vision library
(VisionLib). As well as containing all code relating to specific vision algorithms VisionLib
contains the data structures required by each of the algorithms.
Hi stNorm
FaceFind PrePro
cascade leftImage : IplImage *
update()
faces rightImage : IplImage *
Tools showSettings()
storage normHisto()
getOutputLeft()
getFaceSeq() getOutputRight()
newFaceDetector()
MatchingPoint Hi stMat ch
pt1 : CvPoint
pt2 : CvPoint Vi sio nLib update()
valid : bool showSettings()
images : ImageData
strength : float matchHisto()
ptrMatches : MatchingPoint
preproAlg : PrePro
getDistance()
inputAlg : Input
getPoint() Rec onst ruct
matchAlg : Match
getStrength()
reconstructAlg : Reconstruct
isValid() ImageData
triangulate()
set3D() leftImage createCalibMats()
get3D() rightImage update()
calib calculate3DPoint()
getNoOfTriangles()
getleft() circumCircle()
getRight()
getLeftRectify() Stereo
getRightRectify() ptrMatche s : vector<MatchingPoint> *
Fil e up2Date : bool
prog : CProgre ssCtrl * Constrain3D NoO utl iers
filename : String Input theName : String
<<virtual> > showSettings() calcSD()
showSettings()
<<virtual> > up date() getName() showSet tings()
update()
setProgressRange() setName() upd ate()
loadPoints()
setProgressCtrl()
resetProgress()
makeProg ress() Constraint2D
theName : String
PatternGrid
getName()
unitx : int setName()
unity : int CVFeatures
constrain2Rectangle : bool
min Quality
min Dist ance
showSettings() Match
thre sh
update()
previewLeft : IplImage *
sho wSet tings() previewRight : IplImage * NoDupl icat es NoWeakMatche
upd ate () disparityImage : IplImage * s
find Corners() update() thresh : int
getDisparityMap() showSettings()
getPreviewLeft() update()
getPreviewRight() showSettings()
SSD
Fil e
windx : int
filename : String
windy : int
showSettings()
showSettings()
ZMNCC update()
update()
windx : int loadPoints()
calcIntensityWindow()
matchPoints() windy : int
showSettings()
update()
calcIntensityWindow()
matchPoints()
Figure 15: Simplified UML diagram of VisionLib, the library containing all the computer vision
related code within the project.
Page 41 of 56
April 2004
11.3 Data Structures and Algorithms
The major data structures defined within VisionLib are as follows:
• CvCalibFilter: This is the calibration object. VisionLib utilises this DirectX

DirectShow filter provided by OpenCV to obtain stereo rig calibration data for use
throughout the library.
• ImageData: This class holds all the image data required by the reconstruction.
• MatchingPoint: All of the data regarding point correspondence matches is stored in
this object. This includes x and y co-ordinates on both the left and the right images,
whether the point is deemed valid and the calculated strength of the match. This
object also contains methods for storing the calculated 3D position of the matched
point.
• Reconstruct: This object contains data structures and methods for surface
generation, including a linked-list of joined 3D points specifying the triangular surface
mesh.
The majority of the remaining classes interact with these four data structures to progress
through the different stages of the reconstruction. It is also the data in these three structures
that the views in the main application interface are designed to interact with and display via
the application document object.
In order to support the fast interchange and comparison of algorithms the VisionLib library
exhibits a high degree of polymorphism. The set of objects relating to stereo reconstruction
are derived from the Stereo class. This defines a set of virtual methods which derived objects
must implement. This allows for a consistent interface to each of the different algorithms.
The three sets of algorithms that are derived from the stereo class are Input, Match and
Constraint. Input contains subclasses for handling input point selection, ie. which initial points
we will attempt to correlate. Match contains subclasses for finding corresponding image
points and Constraint handles constraining these matches. Each of these three objects are
further subclasses to provide the actual functionality. For example the Match class is
subclassed to provide implementations of the actual matching algorithms. Currently
implemented here is the SSD and ZMNCC algorithms investigated in the Correlation section
of this report.
Some of the more important aspects of some specific algorithm implementations are
described below:
Page 42 of 56
April 2004
Input Objects:
• CvGoodTrackingFeatures: This algorithm is based on a method within the
OpenCV library. It takes an image which has had a binary threshold filter applied
to it and uses a modification of the Harris detector to find feature points with a
good chance of being matched. The binary threshold filter and point detector can
be applied under at a number of different thresholds simultaneously to provide a
point set which is evenly distributed over the target object.
• PatternGrid: Feature points are chosen by overlaying the input image with
regularly spaced input points which form a rectangular grid. This allows every
pixel in the input image to be selected, either for reconstruction purposes or to
attempt to create a dense disparity map. This is only useful when we need
regular input point spacing, since it makes no guarantees that the points will
make good matches.
• Manual: Input points can be manually entered via a text file in the form of 2D co-
ordinates.
Match Objects:
• SSD and ZMNCC: Both these correlation algorithms behave as described in the
appropriate section of this report except they have both been modified for
improved performance on colour images. The algorithms can optionally take into
consideration information from all three colour channels to help differentiate
between closely contested best matches.
• Manual: Matching points can be entered via a text file containing a list of 2D co-
ordinates.
Constraint Objects:
• Similarity: During the correlation phase a match strength is calculated for each
point candidate based on the similarity of the point and the surrounding area.
When the similarity constraint is applied all points with a match strength below a
given value are marked as invalid.
• Uniqueness: Each point in a dataset is tested for uniqueness with all other co-
ordinates in the dataset.
• Statistical: Certain assumptions can be made about the reconstructed data.
Assuming a normal distribution of the 3D points we can eliminate points that fall
outside a given standard distribution and hence remove some points that may
have been matched incorrectly.
A number of other important classes exist in the library which are not specific to stereo
matching and hence are not derived from the Stereo object. These are Tools, PrePro and
Page 43 of 56
April 2004
Reconstruct. Tools contains miscellaneous tools which do not fall into other categories but
are useful vision algorithms none the less. For example in some cases it may be useful to
find a face within a given image and hence the Tools class contains methods for performing
such tasks. The face finding algorithm is based on Haar feature cascade [54] and is a direct
implementation of functionality provided by the OpenCV library. The PrePro class contains
functionality regarding 2D image manipulation which may be useful in stages prior to
matching. Functionality such as histogram matching is provided which has potential uses in
invariant illumination matching across input images. Finally the Reconstruct object forms the
basis for a set of reconstruction algorithms. This set of objects take 3D point cloud data as
input and return a predicted surface. Bourke’s modified version of Delaunay triangulation is
implemented here.
11.4 Implementation Results
The current implementation of the face scanner vision system meets most of the
requirements specified as goals at the beginning of the implementation. We have indeed
implemented a system that is “capable of reconstructing the visible surface of an object given
left and right images of a stereo pair and calibration data.” We have not implemented a
general vision system, instead focussing on the reconstruction of face objects. Furthermore
the implementation is capable of obtaining accurate calibration data from a sequence of
images containing an appropriate calibration pattern. The application is also successful in
separating vision code from GUI code to ensure maximum re-usability of promising vision
algorithms. The structure of the vision library code also satisfies the goal that the system
“allows future expansion and fast integration of improved algorithms.” This is achieved
through the implementation of a polymorphic class structure within the library.
This enables every component of a vision process to interact with every other component
without regard for the specific algorithms in use. Finally, the separation of the vision code
from the GUI code has the additional benefit of slightly increasing the ease with which the
vision code could be ported to another platform since only the GUI code is heavily platform
dependant.
With regards to the goal that the application should provide output suitable for the basis of a
pose-invarient face recognition system it is unclear if the current application would meet these
requirements. Since we currently have no frontal face recognition system available for testing
purposes the requirements for “good” models for recognition are unclear and hence our
application can not be tested for suitability. It is likely however that a number of additional
features would be required. For example when a face is viewed in a none frontal pose and
Page 44 of 56
April 2004
then rotated to a frontal pose prior to recognition it is possible non-visible surfaces are now
visible.
Figure 16: FaceScanner application screenshot
These surfaces must be estimated to allow proper recognition to take place. Symmetric
properties of the face suggest that we could estimate the missing surfaces from the data
already available. At this stage algorithms aimed at solving this set of missing-data surface
reconstruction problems are a target for future work. The current implementation of the
application does however show some promise in this department since the calibration,
reconstruction and application framework currently in place have proved to be both correct
and accurate. The point correlation algorithms, whilst correct, are relatively basic and more
advanced correlation algorithms need to be implemented. The application is, however,
relatively successful in meeting the demands set by the implementation goals. Figure 16
shows a screenshot of the application running with a reconstruction in progress.
Page 45 of 56
April 2004
12 Software Libraries
A number of libraries and API’s were found to be useful during the development of the stereo
vision system. The libraries which were used are listed below.
Intel’s Image Processing Library (v2.5)

This library provides low level functionality for processing bitmaps, JPEGs and other image
formats. Whilst the stereo vision system does not implement much IPL functionality directly
libraries such as OpenCV rely heavily on functionality provided by this library.
Intel’s OpenCV (v3.1.Beta)

This open source computer vision library contains a mass of functions for many computer
vision related tasks. These include functions from camera calibration and disparity estimation
to the computation of optical flow. This library has since been superseded by the Intel
Performance Primitives library, however, much functionality is reportedly identical to that of
OpenCV. Much of the OpenCV functionality is based on lower level functions provided by the
Intel Performance Primitives Library.
Intel’s Math Kernel Library (v6.1.009)

Our application make use of the linear algebra sections of this maths library. The linear
algebra / least squares technique is used in order to project correlated image points back into
3D space. The routines in linear algebra routines in MKL are based on those implemented in
LAPACK.
Microsoft DirectX SDK (v9.0)

Several of the image capture and camera control routines are based on the DirectX SDK.
The calibration process is also implemented as a DirectX filter.
Microsoft Foundation Classes

The windows interface is programmed to take full advantage of available MFC resources with
the current implementation supporting the Multiple Document Interface Document/View
architecture to allow simultaneous, complex views of the large datasets which we have to
deal with throughout the course of a reconstruction.
OpenGL
All 3D views of our data are rendered using the OpenGL library. This selection was made
rather than alternatives, such as Direct3D, because of its programming simplicity and its wide
support both from application programmers and hardware designers. Furthermore OpenGL is
much more platform independent that the Direct3D library.
Page 46 of 56
April 2004
A number of applications were utilized in the development of the stereo vision system.
Microsoft Visual Studio 6.0

C++ application development was carried out exclusively in this industry standard
development tool. Selected primarily for its MFC support and visual application development.
C++ was selected as the development language due to its speed, versatility and support
which supercedes that achievable through interpreted languages such as Java.
Mathworks Matlab 12
The more mathematically based algorithms within the system were tested for correctness and
underwent fast track development using this matrix evaluation application.
Page 47 of 56
April 2004
13 Results
The system was tested with a variety of data under a number of conditions. Most of the
system tests we implemented using synthesized images. The reason for this is that it is
easier to eliminate unwanted input features such as noise or illumination variations.
Furthermore through the use of synthesized images a correct version of the model we are
trying to reconstruct already exists and hence we have data with which we can compare our
results. Much of the output from the system has been included in earlier sections in order to
demonstrate the correctness of various sub-systems. The project meets the majority of the
goals we set out to achieve. Significant research and development has been carried out into
stereo camera calibration, the correspondence problem, 3D projective reconstruction and
surface estimation. Further to this an implementation of a vision system aimed at tackling and
solving the problems brought about by the reconstruction process has been developed with
the results obtained from the program demonstrating an acceptable level of accuracy.
Practical evidence suggests that the calibration process performs correctly. The calibration
section of this report details some results from a sequence of synthesized images from a
known rig calibration and demonstrates that the results obtained represent the properties of
the actual rig correctly. Testing of the calibration procedures on both real image sequences
and live video have also produced accurate results. Furthermore testing of the system under
a number of varying camera rigs demonstrated that this implementation of the calibration
routines work under the majority of general stereo rig calibrations.
Perhaps the most error-prone area of the reconstruction process at present is the
correspondence matching phase. The correspondence problem is widely considered the
most difficult area of reconstruction and this is demonstrated by our implementation. As
demonstrated in the correspondence section of this report both the SSD and ZMNCC
algorithms perform well and produce good disparity maps however these simple pixel-wise
intensity based algorithms prove to be too light weight to perform well under general
reconstruction conditions. Simple intensity based algorithms are unlikely to yield a quality
solution to this particular correspondence problem. The major reasoning behind this is simply
that correct point matches are often too similar to incorrect candidate matches for intensity
based methods to correctly differentiate between them. The situation is complicated further
by noise during the image capture process of varying light levels between a stereo pair.
These image features cause major errors in the correspondence phase which propagate to
other stages. The addition of numerous matching constraints improves matters at the cost of
reducing the degree of automation present in the system, however, they do not provide a
perfect solution and erroneous points are still matched. This system is unlikely to be
improved much further through the addition of more constraints and as such the development
Page 48 of 56
April 2004
of more advanced matching algorithms using none intensity based methods is going to be a
primary goal of any additional work. In order for the currently implemented correlation
techniques to be effective we need to be matching only a small number of highly salient input
points. This would increase the likelihood of obtaining an accurate result set at the expense
of having a smaller number of points to work with, and hence a less accurate resultant
surface.
Once a set of matches has been found and constrained we can commence with the actual
reconstruction. The calculations behind the 3D projection have been researched in the past
and with the existence of the appropriate projection equations reconstruction to a 3D point
cloud is relatively trivial. The reconstruction stage demonstrates a cube and a face model
reconstruction which serve to demonstrate the accuracy of the technique. Surface
reconstruction from the point cloud provides adequate results using Burke’s Delauney
implementation. Some consideration should be taken into account by the meshing algorithm
of potential errors in previous stages, however, at this time this is not taking place and hence
the surface reconstruction stage does nothing to “smooth over” the errors in the
correspondence stage. To this end a more sophisticated algorithm could be implemented
which attempts to create a smooth surface possibly using techniques such as Beizer curves.
An implementation of the marching cubes algorithm would also provide an interesting
comparison in terms of producing a surface with more desirable properties than that of the
mesh currently produced. Furthermore if the system were to be involved as a recognition
subsystem then it would be essential to consider algorithms for hidden surface reconstruction
to enable rendering of surfaces of the face no initial visible in the input images. As an
alternative to estimating the hidden surface it may well prove useful to implement a system
that utilizes a generic head model to aid reconstruction, this may also prove a viable solution
to increasing the effectiveness of the currently implemented intensity based correlation
algorithms, since a smaller number of points would have to be required due to the volume of
data already available to the system. The current system is already capable of selecting only
input points with a high chance of being matched (using the GoodTrackingFeatures input
algorithm) and as such the framework is already suitable for the addition of a generic head
model.
Figure 17 shows the results of a fully automated reconstruction after a number of thresholds
were set and calibration data acquired. The images used were again synthesized since at
this point it is difficult to obtain accurate reconstructions from actual imagery due to problems
described with the correspondence algorithms above. Also the images were created in such
a manner that the correspondence algorithms would find matches easier to make.
Page 49 of 56
April 2004
Figure 17: Fully automatic reconstruction of a synthesized face from stereo images. White
dots on the 3d model show initial point match positions.
It should be obvious from the output produced that the correspondence algorithms struggled
even under these constrained conditions and still produced numerous incorrect matches,
many of these were eliminated with the application of constraints however the correlation
algorithms are simply not accurate enough to perform on real world data at this point.
Despite some correlation accuracy problems the system performs well throughout. It should
be noted however that problems with the correspondence stage of the reconstruction appear
to be due to properties of the matching algorithms involved rather than fundamental problems
Page 50 of 56
April 2004
with the system. Furthermore we have been successful in creating an application with a
framework such that it is possible to implement new and more efficient algorithms easily and
integrate them with the system with major problems. This has the advantage that despite the
under performance of the current correspondence algorithms new algorithms which show
promise such as wavelet decomposition and matching can be implemented and integrated
into the system so their performance can be analyzed. Thus, despite some errors within
certain areas of the system these errors can be observed and algorithms that perform at a
lower error rate introduced into the system with ease.
The implementation of the FaceScanner application and the development of the VisionLib
library have led to the creation of a successful architecture for investigating various vision
algorithms and reconstruction techniques and thus has proved to be a useful application
implementation. Furthermore the interface with which the user can interact with the system is
of potential commercial quality allowing simple guidance of the reconstruction processes
fueled by intuitive data representation. The reuseable nature of each of the system
components allows future development within the current application framework to improve
current performance.
Page 51 of 56
April 2004
14 Conclusions and Future Work

The majority of the goals specified at the beginning of this report have been met. Successful
research has gone into each stage of the reconstruction process and the system is geared
towards working with a set of CCTV camera. At this stage autonomous reconstruction from
real imagery has not been achieved due to an partially inadequate solution to the
correspondence problem, however, the framework is in place and capable of supporting
future reconstructions in the presence of more powerful matching algorithms. The system
has however proved a number of techniques to be correct and well suited to the task of facial
reconstruction. Testing carried out on synthesized input yielded accurate results in areas of
the system where they were expected. A working implementation of a vision system capable
of stereo calibration and reconstruction has been developed and adheres to the design goals
specified in the implementation section.
With regards to the usefulness of the system output as input to a recognition subsystem the
results are inconclusive. Additional features would certainly have to be implemented and
point matching improved to ensure we could construct an accurate face model, however the
basis for such as system is in place. The addition of hidden surface reconstruction possibly
through the use a generic head model would be essential in ensuring we can recreate
recognizable face models. The development of more accurate point matching algorithms
should probably be the focus of future work since this is the area where currently the
application struggles to perform. This could be further aided by the use of different algorithms
in the feature point detection stage, despite the current algorithm performing reasonably well.
Finally improvements to surface estimation with the addition of mesh smoothing to eliminate
errors from earlier stages would probably provide a system which was very suitable for use in
a pose-invariant face recognition system despite the face that the system is not currently at
this stage.
The stereo reconstruction problem is one of constrained optimization. The existence of

numerous stages in the system leads to the propagation of estimation errors throughout. By
increasing accuracy in any way possible and constraining each part of the system to eliminate
most errors we have produced a system which to a degree is capable of face surface
reconstructions. Despite containing stages which, under some conditions, fail to perform the
system produces accurate results in general. The majority of initial design goals have been
satisfied and with the implementation of some additional algorithms the system could be
made to completely for-fill all aims and goals and find application in face recognition utilities.
Page 52 of 56
April 2004
15 Bibliography
1. A Novel Technique For Face Recognition Using Range Imaging,

http://lcv.stat.fsu.edu/publications/paperfiles/pcarecog.pdf, Last Accessed: 09/04/04
2. Adjouadi and F. Candocia, A Similarity Measure for Stereo Feature Matching. IEEE
Transactions on Image Processing, 1997. 6(10).
3. Akamatsu, S., H.F. T. Sasaki, and Y. Suenaga, A Robust Face Identification Scheme
- KL expansion of an invariant feature space. SPIE Proceedings of Intelligent Robots
and Computer Vision X: Algorithms and Techniques, 1991. 1607: p. 71-84.
4. Anderson, E., Z. Bai, C. Bischof, S. Blackford, J. Demmel, J. Dongarra, J.D. Croz, A.

Greenbaum, S. Hammarling, A. McKenney, and D. Sorensen, LAPACK User Guide.
Society for Industrial and Applied Mathematics, 1999.
5. Bacakoglu, H. and M.S. Kamel, A three-step camera calibration method. IEEE Trans.
Instrumuntation and Measurements, 1997. 46: p. 1165-1172.
6. Balasuriya, L.S. and D.N.D. Kodikara, Frontal View Human Face Detection and
Recognition. 2001.
7. Barber, C.B., D.P. Dobkin, and H. Huhdanpaa, The Quickhull Algorithm for Convex
Hulls. 1996.
8. Beumier, C. and M. Acheroy. Automatic Face Authentication from 3D surfaces. in

British Machine Vision Conference. 2001. Royal Military Academy, Signal & Image
Centre (c/o ELEC) Avenue de la Renaissance, 30 B1000 Brussels, Belgium.
9. Bhatti, A., S. Nahavandi, and H. Zheng. Image Matching using TI Multi-Wavelet

Transform. in VIIth Digital Image Computing: Techniques and Applications. 2003.
Sydney.
10. Bourke, P., An Algorithm for Interpolating Irregularly-Spaced Data with Applications in
Terrain Modelling, 1989, http://astronomy.swin.edu.au/~pbourke/terrain/triangulate/,
Last Accessed: 07/04/2004
11. Bouview, D.J., Double-Time Cubes: A Fast 3D Surface Construction Algorithm for
Volume Visualization. 1994.
12. Carr, J.C., R.K. Beatson, J.B. Cherrie, T.J. Mitchell, W.R. Fright, B.C. McCallum, and
T.R. Evans, Reconstruction and Representation of 3D Objects with Radial Basis
Functions, Applied Research Associates, University of Canterbury NZ.
13. Chan, S.O.-Y., Y.-P. Wong, and J.K. Daniel, Dense Stereo Correspondence Based
on Recursive Adaptive Size Multi-Windowing. 2000.
14. Cooper, O., N. Cambell, and D. Gibson, Automated Meshing of Sparse 3D Point
Clouds, University of Bristol.
15. Discreet, Homepage for the makers of 3D Studio Max, 2004,

http://www.discreet.com/, Last Accessed: 15/04/04
16. Elagin, E., J. Steffens, and H. Neven. Automatic Pose Estimation System For Human
Faces Based on Bunch Graph Matching Technology. in Proceedings of the Third
International Conference on Automatic Face and Gesture Recognition. 1998. Nara,
Japan.
Page 53 of 56
April 2004
17. Faugeras, O. and Q.-T. Luong, The Geometry of Multiple Images. MIT Press, 2001.
18. Fieguth, P.W. and T.J. Moyung, Incremental Shape Reconstruction Using Stereo
Image Sequences. Department of Systems Design Engineering, University of
Waterloo, Ontario, Canada.
19. Forsyth, D. and J. Ponce, Computer Vision: A Modern Approach. 2003: Prentice Hall.
20. Fraser, C. Automated Vision Metrology: A Mature Technology For Industrial

Inspection and Engineering Surveys. in 6th South East Asian Surveyors Congress
Fremantle. 1999. Department of Geomatics, University of Melbourne, Western
Australia.
21. Galo, M. and C.L. Tozzi, Feature Based Matching: A Sequential approach based on
relaxation labeling and relative orientation. 1997.
22. Hall, D., B. Leibe, and B. Schiele. Saliency of Interest Points under Scale Changes. in
British Maching Vision Conference 2002. 2002.
23. Huang, J., V. Blanz, and B. Heisele, Face Recognition with Support Vector Machines
and 3D Head Models. Center for Biological and Computer Learning, M.I.T,
Cambridge, MA, USA and Computer Graphics Research Group, University of
Freiburg, Freiburg, Germany.
24. Intel, Image Processing Library, 2000,

http://developer.intel.com/software/products/perflib/ijl/, Last Accessed: 04-2004
25. Intel, Open Source Computer Vision Library, 2000,

http://www.intel.com/research/mrl/research/opencv/, Last Accessed: 04-2004
26. Kallmann, M., H. Bieri, and D. Thalmann, Fully Dynamic Constrained Delaunay
Triangulations. 2002.
27. Keller, M.G., Matching Algorithms and Feature Match Quality Measures For Model
Based Object Recognition with Applications to Automatic Target Recognition, in
Courant Institute of Mathmatical Sciences. 1999, New York University.
28. Kim, J., V. Kolmogorov, and R. Zabih, Visual Correspondence Using Energy
Minimization and Mutual Information. 2003.
29. Kirby, M. and L. Sirovich, Application of the Karhunen-Loeve Procedure for the
Characterization of Human Faces. Pattern Analysis and Machine Intelligence, 1990.
12: p. 103-108.
30. Kirby, M. and L. Sirovitch, Low dimentsional procedure for the charaterization of
human faces. Opt. Soc, 1987. 2(A): p. 586-591.
31. Laganiere, R. and E. Vincent, Matching Feature Points in Stereo Pairs: A

Comparative Study of Some Matching Strategies. 2001, School of Information
Technology and Engineering, University of Ottawa.
32. Lee, M.W. and S. Ranganath, Pose-invariant face recognition using a 3D deformable
model. Department of Electrical and Computer Engineering, National University of
Singapore, Pattern Recognition, 2003. 36: p. 1835-1846.
33. Lorensen, W.E. and H.E. Cline, Marching Cubes: a high resolution 3d Surface
Reconstruction Algorithm. Computer Graphics, 1987. 21: p. 163-169.
Page 54 of 56
April 2004
34. Lu, X., R.-L. Hsu, A.K. Jain, B. Kamgar-Parsi, and B. Kamgar-Parsi, Face
Recognition with 3D Model-Based Synthesis. 2002.
35. Maas, H.-G., Image sequence based automatic multi-camera system calibration
techniques. 1997, Delft University of Technology, The Netherlands.
36. Mattoccia, S., M. Marchionni, G. Neri, and D. Stefano, A Fast Area Based Stereo
Matching Algorithm. 2002.
37. McLauchlan, P.F., A Batch/Recursive Algorithm for 3D Scene Reconstruction, in

School of Electrical Engineering. 2001, University of Surrey.
38. McLauchlan, P.F., The variable state dimension filter., in VSSP 4/99. 1999, University
of Surrey.
39. McLauchlan, P.F. and A. Jaenicke, Accurate mosaicing using structure from motion
methods, in VSSP 5/99. 1999, University of Surrey.
40. McLauchlan, P.F. and D. Murray. A unifying framework for structure and motion
recovery from image sequences. in 5th International Conference on Computer Vision.
1995. Boston.
41. Memony, Q. and S. Khanz, Camera calibration and three-dimensional world

reconstruction of stereo-vision using neural networks. International Journal of
Systems Science, 2001. 32(9): p. 1155-1159.
42. Moyung, T., Incremental 3D Reconstruction Using Stereo Image Sequences. 2000,
University of Waterloo: Ontario, Canada.
43. Pentland, A., B. Moghaddam, T. Starner, O. Oliyide, and M. Turk, View-based and
Modular Eigenspaces for Face Recognition, in Technical Report No 245. 1994: MIT
Media Laboratory, Perceptual Computing Section.
44. Rosenfield, A. and M. Thurston, Course-fine Template Matching. IEEE Trans.

System, Man and Cybernetics, 1977. 7: p. 104-107.
45. Sanderson, C. and S. Bengio, Robust Features For Frontal Face Authentication in
Difficult Image Conditions. 2003.
46. Sebe, N. and M.S. Lew, Comparing Salient Point Detectors. ICME, 2001.
47. Sharghi, S.D. and F.A. Kamangar, Geometric Feature-Based Matching in Stereo
Images. 1999.
48. Shi, F., N.R. Hughes, and G. Roberts, SSD Matching Using Shift-Invariant Wavelet
Transform: Mechatronics Research Centre, University of Wales College, Newport,
Allt-Yr-Yn Campus.
49. Smith, P., T. Drummond, and R. Cipolla. Segmentation of Multiple Motions by Edge
Tracking between Two Frames. in British Machine Vision Conference. 2000.
50. Theisel, H., Exact Isosurfaces for Marching Cubes. Computer Graphics Forum, 2002.
21(1): p. 19-31.
51. Treece, G.M., R.W. Prager, and A.H. Gee, Regularised marching tetrahedra:
Improved iso-surface extraction. 1998.
Page 55 of 56
April 2004
52. Trucco, E. and A. Verri, Introductory Techniques for 3-D Computer Vision. 1998:
Prentice Hall.
53. Tsai, R.Y., A versatile camera calibration technique for high-accuracy 3D machine
vision metrology using off-the-shelf TV camera and lenses. IEEE Trans. Robot
Automation, 1987. RA-3: p. 323-344.
54. Viola, P. and M. Jones, Rapid Object Detection using a Boosted Cascade of Simple
Features. Accepted Conference on Computer Vision and Pattern Recognition, 2001.
55. Wild, D., Realtime 3D Reconstruction From Stereo. 2003, University of York.
56. Xu, L.-Q., B. Lei, and E. Hendriks, Computer vision for a 3-D visualisation and
telepresence collaborative working environment. BT Technology Journal, 2002. 20(1):
p. 64-74.
57. Yambor, W., B. Draper, and R. Beveridge, Analyzing PCA-based Face Recognition
Algorithms: Eigenvector Selection and Distance Measures. 2000.
58. Yan, J. and H. Zhang, Synthesized Virtual View-Based EigenSpace for Face
Recognition. 1997.
59. Zou, J., P.-J. Ku, and L. Chen, 3D Face Reconstruction Using Passive Stereo. ECSE
6650 - Computer Vision, 2001.
Page 56 of 56

Stereo Vision For Face Recognition Dissertation

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Stereo Vision For Face Recognition Dissertation

Încărcat de

Drepturi de autor:

Formate disponibile

A Correlation Based Stereo Vision System for Face Recognition Applications

A Correlation Based Stereo Vision System For Face

Supervised by: Bai Li

3 Goals and Motivation ................................................................................................6

4 Literature Review ......................................................................................................8

4.1 Face Recognition ....................................................................................................8

4.2 3D Reconstruction ..................................................................................................9

4.3 Surface Estimation................................................................................................11

4.4 Summary ..............................................................................................................12

5 System Outline ........................................................................................................13

6.1 Intrinsic and Extrinsic Parameters.........................................................................15

6.2 Parameter Estimation ...........................................................................................16

6.3 Calibration Testing ................................................................................................18

8.1 Input Point Detection.............................................................................................23

8.2 Intensity Based Pixel-wise Correlation ..................................................................23

8.3 SSD ......................................................................................................................24

8.5 Correspondence Testing.......................................................................................25

8.6 Matching Constraints ............................................................................................27

8.7 Constraint Testing.................................................................................................27

8.8 Alternative Correspondence Measures .................................................................28

9 Projective Reconstruction ......................................................................................30

9.1 Reconstruction Testing .........................................................................................32

10.1 Surface Estimation Testing ...................................................................................34

10.2 Texture Mapping ...................................................................................................36

11.1 Design Choices.....................................................................................................37

11.2 Application Architecture ........................................................................................38

11.3 Data Structures and Algorithms ............................................................................42

11.4 Implementation Results.........................................................................................44

12 Software Libraries ...................................................................................................46

14 Conclusions and Future Work ...............................................................................52

Figure 1: High level outline of the reconstruction system. ......................................................13

Figure 2: Calibration Dialog Screenshot.................................................................................19

Figure 3: Artificial Test Rig Camera Configuration .................................................................20

Figure 4: Graphical representation of epipolar geometry. ......................................................21

Figure 5: A Rectified Input Image Pair ...................................................................................21

Figure 7: SSD (left) and ZMNCC (right) disparity maps..........................................................26

Figure 8: Effects of constraint application...............................................................................28

surface reconstruction (right). ........................................................................................35

Figure 12: Texture mapped model reconstruction ..................................................................36

related code within the project. ......................................................................................41

Figure 16: FaceScanner application screenshot ....................................................................45

dots on the 3d model show initial point match positions. ...............................................50

Three dimensional reconstruction using stereo vision is an important topic of research in

3 Goals and Motivation

In summary the high level aims of the project are:

Application specific implementation goals are discussed in greater detail in the

4.1 Face Recognition

A traditional and much more common approach to 3D reconstruction is represented by a

A prerequisite for some reconstruction stages is camera calibration. This involves

4.3 Surface Estimation

In addition to the recent advancements in mesh generation and surface reconstruction

Figure 1: High level outline of the reconstruction system.

6.1 Intrinsic and Extrinsic Parameters

fs x 0 ox r11 r12 r13 t1

In order to obtain these values in a straightforward manner it is possible to utilise a calibration

6.2 Parameter Estimation

Where m is a 2D co-ordinate, M a 3D co-ordinate and s some arbitrary scale factor. The

From the above equation we can expand to:

(h1 h 2 h3) = sA(r1 r 2 t )

Re-writing the homography equations in column form:

A closed form solution of the camera calibration is therefore:

6.3 Calibration Testing

Figure 2: Calibration Dialog Screenshot