Documente Academic
Documente Profesional
Documente Cultură
!"#$%&'()+,-./012345<yA|
M ASARYK U NIVERSITY
FACULTY OF I NFORMATICS
M ASTER ’ S T HESIS
Filip Roth
ii
Acknowledgement
I would like to thank IBSmm Engineering for allowing parts of my work to be
published as this thesis and also for being understanding and supporting during
my studies.
I would like to thank my advisor, prof. Ing. Václav Přenosil, CSc, for helpful-
ness and guidance during writing of this thesis.
Last but not least, I would like to thank my family and friends for their sup-
port during my studies.
iii
Abstract
This thesis describes the use of current day low cost Field Programmable Gate
Arrays (FPGAs) for realtime broadcast video processing. Capabilities of selected
device family (Altera Cyclone IV) are discussed with regard to video process-
ing. Example IP cores (deinterlacer, alpha blender and frame rate converter) are
designed in Verilog HDL and the design flow is described. The IP cores are im-
plemented in real hardware system. The overall hardware system is described,
together with individual FPGA components providing video input/output and
other I/O functions.
iv
Keywords
FPGA, video processing, deinterlacing, alpha blending, frame rate conversion,
Verilog, HDL, hardware design flow
v
Contents
1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 History of Field Programmable Gate Arrays . . . . . . . . . . . . . . 4
2.2 Present day FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Programmable logic . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Routing resources . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.3 Embedded memory . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.4 Embedded multipliers . . . . . . . . . . . . . . . . . . . . . . 6
2.2.5 Development software . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Future possibilities of FPGAs . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Video processing on an FPGA . . . . . . . . . . . . . . . . . . . . . . 8
3 Broadcast video transport standards . . . . . . . . . . . . . . . . . . . . . 9
3.1 Parallel digital data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Serial digital interface (SDI) . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Digital Video Interface (DVI) . . . . . . . . . . . . . . . . . . . . . . 12
4 Project requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1 Video deinterlacing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Low latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 On-screen display generation . . . . . . . . . . . . . . . . . . . . . . 13
4.4 Video stream switching . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.5 Image capture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5 Device family selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.1 Design requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.2 Altera Cyclone family . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.3 Xilinx Spartan family . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.4 Lattice Semiconductor Corporation . . . . . . . . . . . . . . . . . . . 17
5.5 Final selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6 Evaluation of commercial IP cores from Altera . . . . . . . . . . . . . . 18
6.1 Video and Image Processing Suite (VIP) . . . . . . . . . . . . . . . . 18
6.2 DDR2 High Performance Controller II . . . . . . . . . . . . . . . . . 19
6.3 NIOS II soft processor . . . . . . . . . . . . . . . . . . . . . . . . . . 20
7 Selected system structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
7.1 Block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
7.2 Camera video input . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7.3 Frame buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7.4 USB link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7.5 Deinterlacer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1
7.6 PC video input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.7 Alpha blender . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
8 Example video processing cores . . . . . . . . . . . . . . . . . . . . . . . 25
8.1 Deinterlacer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
8.1.1 Algorithm overview . . . . . . . . . . . . . . . . . . . . . . . 25
Line duplication . . . . . . . . . . . . . . . . . . . . . . . . . 26
Line interpolation . . . . . . . . . . . . . . . . . . . . . . . . 26
Weave algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 26
Motion adaptive algorithm . . . . . . . . . . . . . . . . . . . 27
8.1.2 Algorithm selection . . . . . . . . . . . . . . . . . . . . . . . 27
8.1.3 Principle of operation . . . . . . . . . . . . . . . . . . . . . . 28
8.1.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 29
8.2 Alpha blender . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
8.2.1 Principle of operation . . . . . . . . . . . . . . . . . . . . . . 32
8.2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.3 Frame buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
8.3.1 Principle of operation . . . . . . . . . . . . . . . . . . . . . . 37
8.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 39
9 FPGA design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
9.1 Separate projects for custom components . . . . . . . . . . . . . . . 41
9.2 Use standard interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . 41
9.3 Optimize the memory access pattern . . . . . . . . . . . . . . . . . . 41
9.4 SignalTap II logic analyzer . . . . . . . . . . . . . . . . . . . . . . . . 42
9.5 Horizontal and vertical device migration . . . . . . . . . . . . . . . 42
9.6 Physical I/O pin mapping . . . . . . . . . . . . . . . . . . . . . . . . 43
10 Resulting hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
10.1 Verification of the hardware . . . . . . . . . . . . . . . . . . . . . . . 45
11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
A Pin placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
B Device floorplan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
C Slack histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
D Contents of the enclosed CD . . . . . . . . . . . . . . . . . . . . . . . . . 54
2
Chapter 1
Preface
This work originates in author’s work as a hardware developer at IBSmm Engi-
neering [1], a hardware design house located in Brno, Czech Republic. The video
processing IP (intellectual property) cores presented in this work are part of a
video processing device developed at IBSmm. Due to this fact, the subject of
this thesis is a commercial product into which significant time and effort was
invested.
As it happens in the industry, the IBSmm Engineering management is not
that willing to release the entire product documentation, including board design
files, firmware sources, intellectual property sources for the programmable logic
or schematic documentation to the public domain.
Therefore, a decision was made to make public only selected parts of the de-
sign, demonstrating the approaches and algorithms used to accomplish the re-
quired functions, but not the entire source codes or project files.
For this reason, this work describes the overall system only in brief detail
and full description is given only to the IP cores developed by the author for
providing the video processing functions of the system. The IP core source codes
are available on the enclosed CD and in the online archive at Masaryk University,
each as a separate Quartus II 9.1SP2 project.
The entire FPGA design is the original work of the author of this thesis to-
gether with FPGA pins assignments, timing constraints and major part of the
resulting hardware board schematic (some blocks in the board schematic were
reused from earlier projects and were not done by the author).
The complete project documentation is available upon request, provided that
the requestor signs an NDA with IBSmm Engineering.
It is hoped by the author, that, despite these limitations, this work will give
useful information to readers interested in video processing on an FPGA and also
provide a ”real world” demonstration of the development of a product using
these technologies.
3
Chapter 2
Introduction
Nowadays, as the requirements for processing power of the embedded systems
are growing, many systems are starting to use FPGAs for offloading the process-
ing functions traditionally done by the embedded CPU or ASIC.
This was made possible by the advancements in chip manufacturing technol-
ogy as described by Moore’s law[2], where programmable logic device parame-
ters such as density, processing power, power consumption and cost improved to
become viable alternatives to the traditional approaches.
Additionally, a design using programmable logic offers specific advantages
over other approaches, mainly the possibility to alter the configuration of the
hardware in the field (hence the name), which is a very useful feature considering
problems like bug fixes and frequent needs to modify the design after the product
is finished.
Of course, this flexibility comes at a premium compared to a dedicated ’hard-
ened’ CPU or ASIC, usually both in terms of power consumption and unit price.
However, especially for small production series, the flexibility of programmable
logic may more than balance the additional cost of the device; the CPU may not
be exactly suited to the application and the ASIC development costs may be well
out of bounds of the estimated product volumes.
With the gradual transition of video signal representation from analog signals
like VGA and SCART to the digital domain, programmable logic started to pro-
vide the processing functions where required. With its inherently parallel nature,
these devices are well suited for algorithms requiring high bandwidth and the
calculation of many operations in parallel on the video data.
4
2. I NTRODUCTION
another vendors emerged and the devices started to be used in more market seg-
ments than the initial networking and telecommunications areas. For more in-
depth overview of FPGA history, please see [4].
5
2. I NTRODUCTION
6
2. I NTRODUCTION
7
2. I NTRODUCTION
· Processing the raw frame data can take advantage of the hardened DSP
blocks to ease the timing requirements for the logic fabric itself. Together
with pipelining the individual algorithm operations, this allows the design
of complex video processing paths even with HD resolutions.
· Due to the FPGA flexibility, the video processing path can be tailored to
specific project requirements.
· The flexibility of the FPGA architecture may prove useful for small pro-
duction series, where the development costs of ASIC solution may be pro-
hibitive.
For these reasons, the processing functions required for the project described
in this work were implemented on an FPGA.
8
Chapter 3
9
3. B ROADCAST VIDEO TRANSPORT STANDARDS
The video data represent the scene in some predefined color space. The most
commonly used color spaces are RGB and YCbCr. With RGB color space, the
pixel has red, green and blue component to identify it’s color. The RGB standard
is widely used in the PC industry for video data representation and as graph-
ics card output format. When using YCbCr color space, the pixel has luminance
(brightness) and chrominance (color) coordinates to identify the color. Conver-
sion between these color spaces can be from straightforward to fairly complex,
depending on the requested conversion quality.
The horizontal and vertical resolution of the frame, frame rate, color space and
progressive/interlaced identifier together form a video format. Video formats are
standardized by organizations such as VESA[16] or SMPTE[17].
This chapter gives an overview of video transport standards used for video
input and output of the presented video processing system.
10
3. B ROADCAST VIDEO TRANSPORT STANDARDS
The representation of video data as a parallel clocked bus is most common when
connecting different integrated circuits on a printed circuit board. The bus con-
tains a master clock signal, horizontal and vertical synchronization signals, active
picture indicator (data valid signal) field identifier for interlaced formats and the
video data itself. This format with separate horizontal and vertical synchroniza-
tion is most commonly used, probably for its universality.
Although embedded synchronization can be used (synchronization signals
are not separate wires but are embedded as special sequences directly in the
video data), this may cause design complications when using video processing
ICs which each expect differing embedded synchronization sequences because of
differing standards (e.g. BT656 vs BT1120).
The parallel transmission format requires that the appropriate individual bit
wires have their lengths closely matched to each other to ensure that the pixel
wavefront is properly aligned at the receiver side. With today’s high resolutions
and therefore high pixel clock rates, this data format may also cause problems
with signal crosstalk or reflections from impedance differences, therefore it is a
good practice to use some kind of termination at both the transmitter and receiver
sides.
11
3. B ROADCAST VIDEO TRANSPORT STANDARDS
SDI uses NRZI encoding scheme to encode data and a linear feedback shift
register to scramble the data to control bit disparity. The video stream can also in-
clude CRC (Cyclic Redundancy Check) checksums to verify that the transmission
occurred without an error.
12
Chapter 4
Project requirements
This chapter describes the various requirements for the processing hardware. The
device using the FPGA video processor is to be used in an medical environment
for displaying live video from endoscopic cameras during surgeries. The system
also has to be able to record the video and store the feed either locally or via
network, but these functions are handled by a standard x86 system embedded in
the device and as such are not the topic of this work.
13
4. P ROJECT REQUIREMENTS
14
Chapter 5
15
5. D EVICE FAMILY SELECTION
The maximum frequency required for any part of the design was estimated
to be 150MHz-180Mhz for the most demanding components. Namely the DDR2
memory interface and the deinterlacer module.
The selection of FPGA device family was based on these requirements to-
gether with a preference of wide availability and good online support.
16
5. D EVICE FAMILY SELECTION
17
Chapter 6
18
6. E VALUATION OF COMMERCIAL IP CORES FROM A LTERA
in the 9.0 version IP library), we had to abandon using the provided deinterlacer
component from Altera and had to develop our own solution.
At the time of writing this thesis, when testing the deinterlacer core the com-
pilation runs without any problems. This incident illustrates that the FPGA de-
velopment toolchain is a rather complex software and should be thoroughly eval-
uated before considering using it in a design.
19
6. E VALUATION OF COMMERCIAL IP CORES FROM A LTERA
Regarding the memory access pattern, we needed to read and write sequential
areas of memory and therefore did not need the short memory side burst lengths
of DDR I memory, which could be more appropriate for other algorithms such as
realtime image rotation.
We tested the memory controller core by running the ”memtest” example in-
cluded in the Nios II Embedded Design Suite. The tests passed with no problems
and we therefore decided to use this core.
20
Chapter 7
21
7. S ELECTED SYSTEM STRUCTURE
frame buffer component saves memory bandwidth, since it allows for buffering
of the half fields only and the final full frame is calculated using the deinterlacer
after the synchronization phase. This also means that the images transferred to
the host x86 system are half fields (for the 1080i interlaced input video format)
and have to be stretched to original aspect ratio. It was found that this solution is
perfectly acceptable since there is no visible reduction of quality of the captured
image.
Camera
video input
Deinterlacer Alpha
blender
Video output
22
7. S ELECTED SYSTEM STRUCTURE
23
7. S ELECTED SYSTEM STRUCTURE
The USB interface IC has two channels, one is configured for the RS232 stan-
dard and is used for FPGA system control, the other channel is a one way com-
munication link to the PC for captured still image transfers. The control channel
is connected to an UART component of a controlling SOPC system with the Nios
II soft core processor.
7.5 Deinterlacer
The deinterlacer component is fed the video data by the frame buffer component,
this video data is deinterlaced (if requested) into a full frame and output to the
alpha blender component. The deinterlacer core is described in a separate chap-
ter.
24
Chapter 8
8.1 Deinterlacer
Deinterlacing is used to convert from interlaced video format to a progressive
one. In interlaced video stream, each complete frame is transferred as two half
fields, odd and even. Odd field contains odd picture lines and even field con-
tains even picture lines. By splitting the complete frame info two half fields, the
temporal resolution of the video feed is doubled and the motion appears more
smooth.
Progressive video format transfers frames as complete units, each frame con-
taining all (odd and even) lines of the picture. Progressive video does not have
the same temporal resolution as an interlaced video with the same bandwidth,
on the other hand it offers better vertical resolution and therefore more detailed
image.
Interlaced format is commonly used in broadcast applications and TV indus-
try, whereas progressive format is more common in the PC industry.
25
8. E XAMPLE VIDEO PROCESSING CORES
Line duplication
Line duplication algorithm simply takes the input line and produces two lines
on output, each same as the image line on deinterlacer input. This is the simplest
deinterlacing algorithm, however also the one with the lowest output image qual-
ity. Since the half field lines are duplicated, the output progressive image appears
pixelated in the vertical direction. This is especially visible on sharp, highly con-
trasting edges in the image.
Line interpolation
Line interpolation algorithm does not replicate the missing lines, but instead it
calculates the missing line from the line above and below the missing one. This
produces a complete frame from a single half field, with the quality of the output
image better than the line duplication algorithm. The most visible improvement
is that the sharp contrasting edges appear more smooth thanks to the interpolated
lines.
Weave algorithm
The weave algorithm uses two half fields to produce a progressive output frame.
The method works by merging the odd and even fields directly into one frame.
26
8. E XAMPLE VIDEO PROCESSING CORES
Compared to the Bob algorithms this method needs a storage memory to tem-
porarily store the half field data. This also introduces a half field delay to the
processing chain since the deinterlacer must wait for complete field to produce
an output progressive frame. The output quality of this algorithm is compro-
mised by artifacts on edges in the resulting progressive image; since the fields
used to produce the output originate in different points in time, when the video
feed contains scenes with fast movement, the edges appear distorted since each
field captures the moving object in a different position.
27
8. E XAMPLE VIDEO PROCESSING CORES
28
8. E XAMPLE VIDEO PROCESSING CORES
8.1.4 Implementation
i f ( can advance == 1 )
begin
x = x + 1;
i f ( x == x s i z e )
begin
x = 0;
y = y + 1;
i f ( f i e l d == 0 )
begin
i f ( y == 0 ) m a s t e r s t a t e = 2 ;
i f ( ( y >= 1 ) & ( y < ( y s i z e m i n u s o n e ) ) )
m a s t e r s t a t e [ 1 : 0 ] = { 1 ’ b0 , y [ 0 ] } ;
i f ( y == ( y s i z e m i n u s o n e ) ) m a s t e r s t a t e = 0 ;
end
else
begin
i f ( y == 0 ) m a s t e r s t a t e = 2 ;
i f ( y == 1 ) m a s t e r s t a t e = 0 ;
i f ( ( y >= 2 ) & ( y < y s i z e ) )
m a s t e r s t a t e [ 1 : 0 ] = { 1 ’ b0 , ˜ y [ 0 ] } ;
end
end
end
29
8. E XAMPLE VIDEO PROCESSING CORES
The variables x, y contain the actual position within the video image data, field
is the even/odd field indicator, master state[1:0] is a variable indicating which
of the actions A, B or C should the deinterlacer perform on the actual line and
can advance is a signal indicating that the remaining core components are ready
for next data item.
The deinterlacer ram buffer module is Altera-specific instantiation of an em-
bedded memory block forming a RAM memory to store the image line. The ad-
dress to this embedded RAM memory block is controlled by the scheduler, the
deinterlacer mem addr delay module delays the address signals for the line op-
eration B. The operation B means that the deinterlacer must store the incoming
line to the RAM buffer and at the same time load the data from the very same
memory buffer. Therefore, it is necessary that the data from the buffer can be
read out before the new image line data are saved to the buffer.
The deinterlacer line switch module provides the switching between opera-
tions A, B and C as requested by the scheduler module. Operation A (master state
= 2) means that the data received from the frame buffer component is stored
to the RAM buffer and at the same time the data is routed through the dein-
terlacer line switch to the output FIFO. Operation B (master state = 1) means
that the incoming data is stored to the RAM buffer and at the same time the
previous line data stored in the RAM buffer are read out, sent to the deinter-
lacer line switch where the pixel data is averaged (interpolated) with the actual
line data and sent to the output FIFO. Operation C (master state = 0) does not
read the incoming pixel data but instead simply outputs the stored line from the
RAM buffer to the output FIFO.
The remaining components of the deinterlacer core are mainly support func-
tions to properly align the individual data and control signals to compensate for
the latency of the respective communicating components.
To relax the requirements for the maximum frequency of the device logic fab-
ric, the deinterlacer core processes two pixels at a time. This doubles the used
data bus width, but at the same time allows to halve the operating frequency
while maintaining the required bus bandwidth.
The deinterlacer core expects the field data in a standard RGB color space with
every color component having 8 bit value range (0.. 255).
The interpolation (vertical averaging) of the neighboring half field image lines
is done by adding the individual red, green and blue components of the pixel
color (the two pixels in the RAM buffer from the previous image line and the two
pixels currently being received and stored to the RAM buffer) together and then
doing an one bit position shift right, thereby calculating an arithmetic average of
the two values.
30
8. E XAMPLE VIDEO PROCESSING CORES
31
8. E XAMPLE VIDEO PROCESSING CORES
The alpha value for each pixel can be either fixed for the entire image or deliv-
ered to the blender core as a separate value for each individual pixel, for example
as the unused 8 bits within 32-bit pixel memory window for 24-bit pixel colors. In
this work, the blender core has a fixed value for the alpha channel for the entire
active picture window. Although initially was the per-pixel alpha channel con-
sidered, to provide a simple way for the OSD menu generation, the fixed alpha
solution was preferred. The main reason for this decision was that the PC video
feed is used as the source for the OSD menu and it would be problematic to trans-
mit the alpha channel through the standard 24-bit per color DVI interface. Using
the fixed alpha value, the entire range of the pixel value of the DVI interface can
be used for pixel color space coordinates and the OSD generation is achieved by
simply displaying an image on the x86 host system graphics output.
This solution also has its drawbacks, most notably the inability to display a
non-transparent OSD image on top of the live camera video feed. This was re-
solved to dedicating a single pixel color from the x86 host system a the transpar-
ent color value. When this color is encountered by the blender core, the value of
the camera video pixel is assigned to the output, regardless of the alpha value set-
ting. This allows for the generation of either non-transparent or semitransparent
OSD image on top of the live video feed.
32
8. E XAMPLE VIDEO PROCESSING CORES
This means that the output video feed has the same parameters (pixel clock, tim-
ing, resolution) as the video feed from the x86 host system. Into this video feed is
mixed the live video signal from the camera input using the preceding frame
buffer and deinterlacer components. This allows the system to mix these two
streams with no interruptions in output video timing, since the camera feed is
passed through the frame buffer component and can be therefore matched to the
reference video signal.
The calculation of the output pixel value is divided into separate calculations
for each color component of the pixel color. Each calculation of the output color
component is then further divided into pipelined calculation stages to relax the
timing requirements of the design compared to the case with no pipelining done.
For the calculation of output values the blender core uses the equations 8.1
translated into the integer domain.
8.2.2 Implementation
The output video signal is formed by the output pixel out[23..0] together with
the timing control signals de out, hsync out and vsync out. The output video
feed uses the same clock as the input reference video feed, i.e. clock in.
Following is a code walk through for a single color component (red). The core
starts by registering the input information to reduce the length of the input path
and therefore to improve the maximum operating frequency of the core.
always @( posedge c l o c k i n )
begin
p i x e l a <= p i x e l a i n ;
p i x e l b <= p i x e l b i n ;
end
33
8. E XAMPLE VIDEO PROCESSING CORES
Now the core has the input pixel information available in the internal regis-
ters. The alpha value for the current video frame is registered during the vertical
blanking interval of the reference video signal. This way, the alpha value is forced
to be the same for each individual video frame. To further improve the maximum
frequency of the core, the alpha value for both video inputs is immediately calcu-
lated (the layerAalpha and the (1 - layerAalpha ) expression as described in equations
8.1).
always @( posedge c l o c k i n )
begin
i f ( v s y n c i n == 1 )
begin
a l p h a a [ 7 : 0 ] = alpha [ 7 : 0 ] ;
a l p h a b [ 7 : 0 ] = 8 ’ d255 − alpha [ 7 : 0 ] ;
end
end
The blender core then continues by calculation of the intermediate values for
the expressions listed in (8.1). The core produces intermediate values for pixel a
and pixel B color components. Since there were some problems with the integer
representation of the equations (uneven mapping of the multiplication results,
the value for component output with layerXalpha = 255 was 254), the core checks
for the alpha value and if found maximal, the core simply outputs the respective
color components. If this is not the case, the core performs an integer multiplica-
tion of the pixel color component and the alpha value.
always @( posedge c l o c k i n )
begin
i f ( a l p h a a == 2 5 5 )
begin
r e d a [ 1 5 : 0 ] = { p i x e l a [ 7 : 0 ] , 8 ’ b00000000 } ;
re d b = 0 ;
end
else red a = p i x e l a [ 7 : 0 ] * alpha a ;
i f ( a l p h a b == 2 5 5 )
begin
re d b [ 1 5 : 0 ] = { p i x e l b [ 7 : 0 ] , 8 ’ b00000000 } ;
red a = 0 ;
end
e l s e re d b = p i x e l b [ 7 : 0 ] * a l p h a b ;
34
8. E XAMPLE VIDEO PROCESSING CORES
ters red a pipe and red b pipe. The core then continues by producing the final
pixel output value. In this step the core also checks for the transparent color as
described above and decides whether to output the pixel value based on the pre-
vious calculations or whether to output the camera video feed pixel value directly.
The color selected as the transparent color for the video overlay is 0xFF00FF (ma-
genta), considered very unlikely to appear in the x86 host system video output
under normal conditions.
always @( posedge c l o c k i n )
begin
i f ( pc 1 == 2 4 ’ hFF00FF )
begin
red [ 1 5 : 8 ] = cam 1 [ 7 : 0 ] ;
green [ 1 5 : 8 ] = cam 1 [ 1 5 : 8 ] ;
blue [ 1 5 : 8 ] = cam 1 [ 2 3 : 1 6 ] ;
end
else
begin
red = r e d a p i p e + r e d b p i p e ;
green = g r e e n a p i p e + g r e e n b p i p e ;
blue = b l u e a p i p e + b l u e b p i p e ;
end
end
The cam 1 register stores the camera input pixel value for the actually pro-
cessed pixel, the pc 1 register stores the original reference video feed pixel value.
Separate registers are necessary, since in this stage of the processing pipeline the
original pixel a and pixel b registers contain newer pixels due to the processing
latency. This is the last step of the processing pipeline.
To compensate for the latency introduced by the individual processing stages,
it is necessary to also properly align the output video timing signals to match the
active picture data. This is done by a simple delay stage inside the blender core.
always @( posedge c l o c k i n )
begin
delay 3 = delay 2 ;
delay 2 = delay 1 ;
d e l a y 1 = { de in , hsync in , v s y n c i n } ;
end
The individual bits of the delay 3 register are then output as the final timing
control signals.
At first, the design of the core did not use any pipelining and the core had
very low performance of terms of maximum allowable frequency of the incom-
ing video signal. After introducing the pipelining stages, the core is capable of
handling 150+ MHz input video signal pixel clocks and therefore supports the
required HD resolutions (the pixel clock of the 1080p video format is 148.5MHz).
35
8. E XAMPLE VIDEO PROCESSING CORES
Double buffering uses two buffers, one for data producer and one for data
consumer. Use of double buffering method is limited to cases where the data
production can be controlled in a way as to work synchronously with the data
consumption. For example, this is the case with graphic cards in the PC industry.
To remove video tearing during the display of rendered scenes, the GPU renders
the scene into a different buffer than the one used to send the frame data to the
monitor. These buffers are flipped in the vertical blanking period of the monitor
output timing signal. This removes video tearing but at the same time it intro-
duces inefficiency to the render process, since the GPU has to wait with the start
of frame rendering process for the start of the vertical blanking interval (other-
wise the GPU has no free buffer to render the scene into).
For situations in which the data production can not be synchronized with data
consumption, double buffering is not optimal since this method has to drop a
large number of data units. When processing video, this behavior is clearly visible
in scenes with moving objects where the motion appears choppy and unnatural.
Triple buffering removes the problems of double buffering by introducing a
third buffer. This ensures that one buffer is always available for either the data
producer or the data consumer.
When considering using triple buffering for the synchronization of two video
streams, each having a different frame rate, this method can provide a simple
frame rate conversion. For example, if the synchronized stream has a lower frame
rate than the stream being synchronized to, the data consumer can reuse the cur-
36
8. E XAMPLE VIDEO PROCESSING CORES
data producer
has finished
storing data
Buffer A Buffer B and requests Buffer C Buffer B
(writing) (reading) a free buffer (writing) (reading)
Buffer C Buffer A
(idle) (idle)
rently processed frame by duplicating it without any impact on the data producer
and the frame storage. The same is valid for an opposite scenario, where the data
producer stores the frames at a higher rate than the data consumer is processing
the frames. In this case, the frames can be dropped from the output stream, again
without any interruption of the data consumption process.
The frame buffer was implemented using an SOPC builder generated compo-
nent. The module uses several subcomponents connected together by the Avalon
Interface developed by Altera[25]. The main parts of the frame buffer module are
the frame writer component, frame reader component, usb reader component
and the necessary DDR2 memory controller core. By using the Avalon interface,
the system could take advantage of the Altera provided High Performance DDR2
Memory Controller Core.
The frame writer was based on an example template of Avalon Interface Mem-
ory Mapped Master with burst writes[26]. This is a template for a master compo-
nent located in an Avalon interface SOPC system. This component can be thought
of as an DMA (Direct Memory Access) engine with the source data not being in
the target address space. The component provides a bridging function between
the Avalon subsystem and user logic design. The original component includes a
single clock FIFO to provide the input from the user design into the Avalon sub-
system. For the purposes of this project this single clock FIFO was replaced with
a dual clock version of the same, therefore providing the clock crossing function
required to convert the video data from the camera video input clock domain to
the internal Avalon subsystem clock domain. There were also some other modi-
fications to the template, mainly regarding the start of the transfer. The bursting
capability enables the DMA engine to transfer large portions of the video data,
37
8. E XAMPLE VIDEO PROCESSING CORES
this greatly improves the efficiency of DDR2 memory access. The bursting capa-
bility is more discussed in the implementation section.
The frame reader is also based on the Altera template, namely on an Avalon
Interface Memory Mapped Master with burst writes. This component works in
the same way as the frame reader, but it instead reads data from the DDR2 mem-
ory address space and transfers them using an dual clock FIFO into the user de-
sign. The user design is in this case the deinterlacer module.
The USB reader component is based on the same component as the frame
reader, but this time the output is connected to the external USB to FIFO bridge
(FT2232H[21]) from FTDI. The component was modified to adapt to the timing
required by the external USB bridge chip.
The last part of the frame buffer component is the Altera DDR2 High Perfor-
mance Memory Controller core. This component provides the means to access the
external DDR2 memory chip by mapping it into the Avalon subsystem memory
space.
This entire component is controlled by a second SOPC builder subsystem lo-
cated inside the FPGA. This subsystem provides the control functions for the en-
tire FPGA design and the peripheral devices, including the communication link
of the device to the PC using a second data channel of the FT2232H bridge, con-
trol of the timing of frame reading and writing of the frame buffer components,
input video format detection and other support functions of the FPGA system.
This subsystem is not described here due to the reasons described in the Preface
to this work.
38
8. E XAMPLE VIDEO PROCESSING CORES
8.3.2 Implementation
As was said above, the component was implemented as a SOPC builder sub-
system. The system is clocked by a 81MHz clock signal generated by an PLL
component from an input 27MHz clock signal. To reduce the maximum operat-
ing frequency requirements, the design uses a 64-bit width of the interconnecting
Avalon bus fabric.
Local side of the DDR2 memory controller core is the connection to the Avalon
bus fabric, memory side is the connection to the external DDR2 memory chip.
The DDR2 memory controller core is configured to run the external DDR2 16-bit
memory chip at 162MHz, the half frequency of the local side bus of the frame
buffer subsystem. Since both clocks are generated by a single PLL from the same
reference clock source, there is no need for a clock crossing bridge when moving
data to and from the DDR2 memory space. The memory controller is configured
to use a burst length of 8 beats on the memory side. This translates to 128-bits of
a local side data transaction being transferred on two clock periods of the local
side bus, since (in bytes):
This provides a highly efficient data path to transfer the video frame data. The
DDR2 memory controller core is also configured to allow for a local burst size of
64 words. This means that the core can be exclusively accessed by one of the three
masters on the bus (frame reader, frame writer, USB link) for a maximum of 64
data word transfers, further improving the memory access efficiency.
When testing the frame buffer subsystem on a custom development board,
the DDR2 memory channel had the width of 32 physical bits (there were two
physical x16 DDR2 chips, each storing one half of the 32-bit word) and the frame
buffer subsystem did not use any burst transfers. With this wide channel con-
figuration, the system worked fine. However, after trying to optimize the design
to use only a single x16 DDR2 memory device, it was found that the single x16
DDR2 device is not capable to provide the required bandwidth with this setup.
This was clearly demonstrated on the quality of the resulting output video feed,
from mild occasional tearing to a complete destruction of the output video struc-
ture after few seconds from the system startup. After some elaboration with the
39
8. E XAMPLE VIDEO PROCESSING CORES
SignalTap II logic analyzer it was found that the DDR2 memory is accessed in a
highly inefficient manner.
The memory address space is shared by three Avalon master components,
frame reader, frame writer and USB link. Since the master access to the slave
memory address space is scheduled by an round robin arbiter, this resulted in a
high number of accesses for a very small data items. The memory controller had
to continuously switch either the currently accessed row and column or switch
between read and write operations. Both these events cause a high performance
penalty by introducing wait cycles in between the individual operations.
For this reason, the then-used Avalon example templates were changed to
the bursting capable versions and the then-used memory controller core was
changed to the burst capable version. This resulted in a significant performance
improvement of the system, since now the memory core was accessed for very
large transactions, diminishing the performance penalties of the ”small and fre-
quent” access pattern.
The read master template used as a base for the custom project components
was modified by adding a dual clock FIFO instead of the originally used single
clock. Because the Verilog module source instantiates the FIFO as a standalone
entity and the interfaces are relatively similar for both the dual and single clock
version, the modification was fairly straightforward.
the master to video fifo the master to user fifo (
. aclr ( control reset ) ,
. data ( m a s t e r r e a d d a t a ) ,
. rdclk ( fi fo rea d clo ck inp ut ) ,
. rdreq ( f i f o r e a d r d r e q i n p u t ) ,
. wrclk ( c l k ) ,
. wrreq ( m a s t e r r e a d d a t a v a l i d ) ,
. q ( fifo read data output ) ,
. rdempty ( f i f o r e a d r d e m p t y o u t p u t ) ,
. wrusedw ( f i f o u s e d ) ,
. wrfull ( )
);
The other significant modification to the reader template was the inclusion
of the FT2232H FIFO interface for the USB link component. This was needed to
properly present the parallel data to the external chip.
The frame writer component works in the same way as the frame reader com-
ponent with the data direction reversed. The input video data from the camera is
fed to the input dual clock FIFO by using the data enable signal of the video tim-
ing information. By using this signal as a write enable signal for the FIFO, only
the active picture data are stored to the FIFO component.
40
Chapter 9
41
9. FPGA DESIGN FLOW
data word is discarded. This may hold true for other components of the Avalon
fabric as well.
The memory access pattern is an important design decision as well. As de-
scribed in the previous chapter, the differences in memory bandwidth between
bursting and non-bursting transfers can be huge. There is plenty of information
available on how to optimize the access patterns for a given task[27][28].
42
9. FPGA DESIGN FLOW
43
Chapter 10
Resulting hardware
After the evaluation of the design on an development board a final hardware
board was developed for the FPGA video processor. Based on the design require-
ments a target device was selected to fit the design. The device selected from the
Cyclone IV family was EP4CE15F17C7N with the option to migrate to an older
Cyclone III generation with a device EP3C16F256C7N. These devices have very
similar pinout and therefore can be interchanged with no major problems.
Figure 10.1: Final hardware board containing the FPGA video processor
The entire system resource usage (after optimizations done by the synthesis
and fitter compilation phases) is 13565 logic elements (88% usage), 313072 mem-
ory bits (61% of the device resources) and 151 pins (91%). An illustration of the
final pin usage can be seen in the appendix section to this work. Of this resource
usage the deinterlacer block uses 501 LEs, alpha blender uses 228 and the frame
buffer component uses 8269 LEs. The high resource usage of the frame buffer is
mostly caused by the DDR2 memory controller core (6304 LEs) and Avalon bus
arbitration (776 LEs). The reader and writer components use about 400 logic ele-
ments each.
The DDR2 x16 memory is connected to banks 3 and 4, the other I/O pins are
mostly used to connect to the parallel video data buses of the two video input ICs
and one video output chip.
44
10. R ESULTING HARDWARE
45
Chapter 11
Conclusion
This work presented the overview of the development process of a hardware
device using Field Programmable Gate Array and provided some example IP
components for video processing to illustrate the possibilities and approaches
used for said development.
The capabilities of the selected low cost FPGA family from Altera was found
to satisfy the requirements for processing power with no major problems. The
flexibility of the FPGA architecture proved very useful since there were many
minor modifications to the system behavior throughout the development of the
project. Although the per-chip price of FPGA is still higher than that of the ASIC,
for small production series this difference is balanced by the flexibility and the
ability to customize the design to the project specific requirements.
The overall development of the system took about one year, as of time of writ-
ing of this thesis the system is being prepared for production. The latency of the
displayed live video feed was found well within the required range, providing
the final users with realistic response of the system to the scene being displayed
in the video feed.
Except the cores developed by the author, the design uses many ready to use
cores provided by Altera. The experiences with their use are mostly positive, with
the exception of some of the Video and Image Processing Suite components which
had problems even to successfully compile. This illustrates the need to properly
evaluate the IP considered to use before the actual development.
Of course, the design could be done better. With retrospect, the design may
be done more effectively in terms of simplicity and overall readability. For ex-
ample, the frame buffer component is controlled by a SOPC controller with the
Nios II processor core. All the control signals and data busses are routed as con-
nections in the top level schematic file from the controller module to the frame
buffer module. Together with the frequent modifications to the design this cre-
ated a rather unreadable and messy mesh of interconnecting wires. Should the
system be developed now, all of the control signals would be removed by set-
ting the core parameters using registers accessible via the Avalon interface. This
would result in a single entity, clean and readable.
The problematics of FPGA design is a rather attractive one; since many com-
puting and embedded projects slowly transition from single core to parallel archi-
tectures, the programmable logic is a viable field of interest. With the announced
production level quality partial reconfiguration[29] the devices may soon be able
46
11. C ONCLUSION
(in an usable way) to instantiate portions of hardware design on the go, practi-
cally providing hardware on demand capability. This may well be the case for
future architectures of a digital system, where some fixed silicon sequential core
loads hardware blocks as required by an operating system. There may be inter-
esting research topics in this area considering the possible implications of the
discovery of the memristor[7], which is a circuit element capable of acting both
as a storage element and as logic. The implications of the merger of high density
storage together with programmable logic resources would be immense.
Should the memristor discovery indeed deliver the technology for the next
generation of programmable logic, the devices may be capable of further reduc-
ing the performance gap between programmable and fixed silicon and start to
appear in systems traditionally relying on an ASIC solution.
47
Bibliography
[1] IBSmm Engineering Czech Design Center official website [online]. IBSmm
CDC, [cited 2011-04-17]. URL: http://www.ibsmm.com.
[2] Wikipedia: Moore’s law [online]. Wikipedia, [cited 2011-04-17]. URL: http:
//en.wikipedia.org/wiki/Moore%27s_law.
[3] Wikipedia: Programmable logic devices [online]. Wikipedia, [cited 2011-04-
17]. URL: http://en.wikipedia.org/wiki/Programmable_logic_
devices.
[4] Wikipedia: Field-programmable gate array [online]. Wikipedia,
[cited 2011-04-17]. URL: http://en.wikipedia.org/wiki/
Field-programmable_gate_array.
[5] Altera Cyclone IV Device Handbook [online]. Altera Corp., [cited 2011-04-
26]. URL: http://www.altera.com/literature/hb/cyclone-iv/
cyiv-5v1.pdf.
[6] Altera Stratix V Device Family Overview [online]. Altera Corp., [cited 2011-
04-17]. URL: http://www.altera.com/literature/hb/stratix-v/
stx5_51001.pdf.
[7] Finding the Missing Memristor, R. Stanley Williams [online]. Lecture
recording, YouTube, [cited 2011-04-17]. URL: http://www.youtube.
com/watch?v=bKGhvKyjgLY.
[8] Achronix Speedster22i product overview [online]. Achronix Semiconductor
Corp., [cited 2011-04-17]. URL: http://www.achronix.com/products/
speedster22i.html.
[9] Wikipedia: Memristor [online]. Wikipedia, [cited 2011-04-17]. URL: http:
//en.wikipedia.org/wiki/Memristor.
[10] End of the CPU? HP demos configurable memristor [on-
line]. R Colin Johnson, EE Times, [cited 2011-04-17]. URL:
http://www.eetimes.com/electronics-news/4088557/
End-of-the-CPU-HP-demos-configurable-memristor.
[11] Intel Introduces Configurable Intel Atom-based Processor [online]. Press
release, Intel Corporation, [cited 2011-04-18]. URL: http://newsroom.
intel.com/docs/DOC-1512.
48
11. C ONCLUSION
[12] Altera Video and Image Processing Suite User Guide [online]. Altera
Corp., [cited 2011-04-26]. URL: http://www.altera.com/literature/
ug/ug_vip.pdf.
[17] Society of Motion Picture and Television Engineers official website [online].
SMPTE, [cited 2011-04-26]. URL: http://www.smpte.org/home/.
[18] Wiki: Serial Digital Interface [online]. Wikipedia, [cited 2011-04-26]. URL:
http://en.wikipedia.org/wiki/Serial_digital_interface.
[19] Altera Corporation official website [online]. Altera Corp., [cited 2011-04-07].
URL: http://www.altera.com.
[24] FPGA designer curriculum [online]. Altera Corp., [cited 2011-05-29]. URL:
http://www.altera.com/education/training/curriculum/
fpga/trn-fpga.html.
49
11. C ONCLUSION
[27] The Efficiency of the DDR & DDR2 SDRAM Controller Compiler [on-
line]. Altera Corp., [cited 2011-05-29]. URL: http://www.altera.com/
literature/wp/wp_ddr_sdram_efficiency.pdf.
[28] Altera Forums [online]. Altera Corp., [cited 2011-05-29]. URL: http://
www.alteraforum.com/.
[29] Cyclone V Device Family Advance Information Brief [online]. Altera Corp.,
[cited 2011-05-29]. URL: http://www.altera.com/literature/hb/
cyclone-v/cyv_51001.pdf.
50
Appendix A
Pin placement
The pin placement illustration for the target device EP4CE15F17C7N. This place-
ment is also compatible with the alternative EP3C16F256C7N.
51
Appendix B
Device floorplan
The resulting floorplan for the target device EP4CE15F17C7N showing a rela-
tively high resource usage of the chip by the design.
52
Appendix C
Slack histogram
This chart shows the TimeQuest timing analysis slack distribution histogram for
the final design.
53
Appendix D
54