Using Low Cost Fpgas For Realtime Video Processing: Filip Roth

}w
!"#$%&'()+,-./012345<yA|
M ASARYK U NIVERSITY
FACULTY OF I NFORMATICS
Using low cost FPGAs for realtime

video processing
M ASTER ’ S T HESIS
Filip Roth
Brno, spring 2011

Declaration
Hereby I declare, that this paper is my original authorial work, which I have
worked out by my own. All sources, references and literature used or excerpted
during elaboration of this work are properly cited and listed in complete refer-
ence to the due source.
Advisor: prof. Ing. Václav Přenosil, CSc
ii
Acknowledgement
I would like to thank IBSmm Engineering for allowing parts of my work to be
published as this thesis and also for being understanding and supporting during
my studies.
I would like to thank my advisor, prof. Ing. Václav Přenosil, CSc, for helpful-
ness and guidance during writing of this thesis.
Last but not least, I would like to thank my family and friends for their sup-
port during my studies.
iii
Abstract
This thesis describes the use of current day low cost Field Programmable Gate
Arrays (FPGAs) for realtime broadcast video processing. Capabilities of selected
device family (Altera Cyclone IV) are discussed with regard to video process-
ing. Example IP cores (deinterlacer, alpha blender and frame rate converter) are
designed in Verilog HDL and the design flow is described. The IP cores are im-
plemented in real hardware system. The overall hardware system is described,
together with individual FPGA components providing video input/output and
other I/O functions.
iv
Keywords
FPGA, video processing, deinterlacing, alpha blending, frame rate conversion,
Verilog, HDL, hardware design flow
v
Contents
1 Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 History of Field Programmable Gate Arrays . . . . . . . . . . . . . . 4
2.2 Present day FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Programmable logic . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Routing resources . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.3 Embedded memory . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.4 Embedded multipliers . . . . . . . . . . . . . . . . . . . . . . 6
2.2.5 Development software . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Future possibilities of FPGAs . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Video processing on an FPGA . . . . . . . . . . . . . . . . . . . . . . 8
3 Broadcast video transport standards . . . . . . . . . . . . . . . . . . . . . 9
3.1 Parallel digital data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Serial digital interface (SDI) . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Digital Video Interface (DVI) . . . . . . . . . . . . . . . . . . . . . . 12
4 Project requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.1 Video deinterlacing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.2 Low latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
4.3 On-screen display generation . . . . . . . . . . . . . . . . . . . . . . 13
4.4 Video stream switching . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.5 Image capture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
5 Device family selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.1 Design requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.2 Altera Cyclone family . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.3 Xilinx Spartan family . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
5.4 Lattice Semiconductor Corporation . . . . . . . . . . . . . . . . . . . 17
5.5 Final selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6 Evaluation of commercial IP cores from Altera . . . . . . . . . . . . . . 18
6.1 Video and Image Processing Suite (VIP) . . . . . . . . . . . . . . . . 18
6.2 DDR2 High Performance Controller II . . . . . . . . . . . . . . . . . 19
6.3 NIOS II soft processor . . . . . . . . . . . . . . . . . . . . . . . . . . 20
7 Selected system structure . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
7.1 Block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
7.2 Camera video input . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7.3 Frame buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7.4 USB link . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
7.5 Deinterlacer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
1
7.6 PC video input . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
7.7 Alpha blender . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
8 Example video processing cores . . . . . . . . . . . . . . . . . . . . . . . 25
8.1 Deinterlacer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
8.1.1 Algorithm overview . . . . . . . . . . . . . . . . . . . . . . . 25
Line duplication . . . . . . . . . . . . . . . . . . . . . . . . . 26
Line interpolation . . . . . . . . . . . . . . . . . . . . . . . . 26
Weave algorithm . . . . . . . . . . . . . . . . . . . . . . . . . 26
Motion adaptive algorithm . . . . . . . . . . . . . . . . . . . 27
8.1.2 Algorithm selection . . . . . . . . . . . . . . . . . . . . . . . 27
8.1.3 Principle of operation . . . . . . . . . . . . . . . . . . . . . . 28
8.1.4 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 29
8.2 Alpha blender . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
8.2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.3 Frame buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
8.3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 39
9 FPGA design flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
9.1 Separate projects for custom components . . . . . . . . . . . . . . . 41
9.2 Use standard interfaces . . . . . . . . . . . . . . . . . . . . . . . . . . 41
9.3 Optimize the memory access pattern . . . . . . . . . . . . . . . . . . 41
9.4 SignalTap II logic analyzer . . . . . . . . . . . . . . . . . . . . . . . . 42
9.5 Horizontal and vertical device migration . . . . . . . . . . . . . . . 42
9.6 Physical I/O pin mapping . . . . . . . . . . . . . . . . . . . . . . . . 43
10 Resulting hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
10.1 Verification of the hardware . . . . . . . . . . . . . . . . . . . . . . . 45
11 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
A Pin placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
B Device floorplan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
C Slack histogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
D Contents of the enclosed CD . . . . . . . . . . . . . . . . . . . . . . . . . 54
2
Chapter 1
Preface
This work originates in author’s work as a hardware developer at IBSmm Engi-
neering [1], a hardware design house located in Brno, Czech Republic. The video
processing IP (intellectual property) cores presented in this work are part of a
video processing device developed at IBSmm. Due to this fact, the subject of
this thesis is a commercial product into which significant time and effort was
invested.
As it happens in the industry, the IBSmm Engineering management is not
that willing to release the entire product documentation, including board design
files, firmware sources, intellectual property sources for the programmable logic
or schematic documentation to the public domain.
Therefore, a decision was made to make public only selected parts of the de-
sign, demonstrating the approaches and algorithms used to accomplish the re-
quired functions, but not the entire source codes or project files.
For this reason, this work describes the overall system only in brief detail
and full description is given only to the IP cores developed by the author for
providing the video processing functions of the system. The IP core source codes
are available on the enclosed CD and in the online archive at Masaryk University,
each as a separate Quartus II 9.1SP2 project.
The entire FPGA design is the original work of the author of this thesis to-
gether with FPGA pins assignments, timing constraints and major part of the
resulting hardware board schematic (some blocks in the board schematic were
reused from earlier projects and were not done by the author).
The complete project documentation is available upon request, provided that
the requestor signs an NDA with IBSmm Engineering.
It is hoped by the author, that, despite these limitations, this work will give
useful information to readers interested in video processing on an FPGA and also
provide a ”real world” demonstration of the development of a product using
these technologies.
3
Chapter 2
Introduction
Nowadays, as the requirements for processing power of the embedded systems
are growing, many systems are starting to use FPGAs for offloading the process-
ing functions traditionally done by the embedded CPU or ASIC.
This was made possible by the advancements in chip manufacturing technol-
ogy as described by Moore’s law[2], where programmable logic device parame-
ters such as density, processing power, power consumption and cost improved to
become viable alternatives to the traditional approaches.
Additionally, a design using programmable logic offers specific advantages
over other approaches, mainly the possibility to alter the configuration of the
hardware in the field (hence the name), which is a very useful feature considering
problems like bug fixes and frequent needs to modify the design after the product
is finished.
Of course, this flexibility comes at a premium compared to a dedicated ’hard-
ened’ CPU or ASIC, usually both in terms of power consumption and unit price.
However, especially for small production series, the flexibility of programmable
logic may more than balance the additional cost of the device; the CPU may not
be exactly suited to the application and the ASIC development costs may be well
out of bounds of the estimated product volumes.
With the gradual transition of video signal representation from analog signals
like VGA and SCART to the digital domain, programmable logic started to pro-
vide the processing functions where required. With its inherently parallel nature,
these devices are well suited for algorithms requiring high bandwidth and the
calculation of many operations in parallel on the video data.
2.1 History of Field Programmable Gate Arrays

The history of Field Programmable Logic Arrays (FPGAs) began with the ad-
vent of programmable logic arrays (PLAs)[3]. From today’s point of view, these
devices were relatively simple and were used mainly as ’glue logic’, merging sev-
eral discrete combinational logic ICs into one chip.
The programmable logic devices progressed hand in hand with the advance-
ments in IC manufacturing technology and architecture theory, and in 1985 the
first commercially viable Field Programmable Gate Array was developed by Xil-
inx, Inc[4].
From that point, the FPGAs started to grow both in density and capabilities,
4
2. I NTRODUCTION
another vendors emerged and the devices started to be used in more market seg-
ments than the initial networking and telecommunications areas. For more in-
depth overview of FPGA history, please see [4].
2.2 Present day FPGAs

Nowadays, FPGAs are a standard off-the-shelf components, ranging in size and
capabilities. Usually, the FPGA is composed of configurable logic, routing re-
sources, embedded memory, multipliers and a range of hardened peripheral in-
terfaces. Not physically present in the FPGA, but from the design standpoint an
integral part of the device design flow, is the FPGA development software.
2.2.1 Programmable logic

The programmable logic is composed of LUTs (look up tables, sometimes also
called LEs - logic elements), which are SRAM based cells performing user defined
function given by the FPGA configuration bitstream. The exact LUT structure
varies by manufacturer and device family, for illustration shown here is the LUT
structure of Altera Cyclone IV device family.
Figure 2.1: Altera Cyclone IV LUT structure[5]
5
2. I NTRODUCTION
2.2.2 Routing resources

To enable connections between logic elements themselves and between logic ele-
ments and any other parts of the chip, the FPGA contains the interconnect.
These are ’hardened’ connection paths inside the chip, either general purpose
for user design or with a specific function, e.g. clock distribution networks.
The clock distribution paths are designed in such a way as to provide uniform
clock distribution with minimal skew over all parts of the chip. This is an impor-
tant part of the interconnect fabric, since most FPGA designs are synchronous
and the quality of clock distribution directly affects the maximum frequency at
which the user design can properly function (this maximum frequency is usually
called fmax).
Also, this is the part of an FPGA occupying the most silicon resources of the
chip. Some estimates quote up to 90% of the silicon die is dedicated to routing[7].
2.2.3 Embedded memory

Many of the FPGA designs require some kind of fast memory for temporary
storage of intermediate results, data buffers and other. For this reason, the chip
contains embedded memory blocks. These are hardened SRAM memory units,
usually configurable for different memory sizes, data widths or single/dual port
access.
2.2.4 Embedded multipliers

Since FPGAs are well suited for digital signal processing (DSP), most device fam-
ilies contain hardened multipliers. This provides the designer with optimized
blocks with higher performance (fmax) than soft (in logic) implementations and
also frees up logic resources, which would otherwise be needed to implement the
multiplier function. The DSP blocks are usually fixed point, newer and high end
FPGA families implement hardened floating-point-optimized components[6].
2.2.5 Development software

An important part of the FPGA development is the design software. This soft-
ware package provides the designer with interface to all FPGA design stages,
from design entry to programming of the configuration memory. This software
is responsible for transferring the user design to a selected physical device and
its structure while meeting the user requirements for design timing (timing con-
straints).
Contrary to the software world, where the compilation times are relatively
small and the iterative development method cycle is short, a larger FPGA design
can take several hours to compile. The compiler must analyze the design, convert
the algorithms into device-specific blocks and fit the resulting netlist into the se-
lected device fabric. When the design uses a large portion of the device resources
6
2. I NTRODUCTION
or has high requirements for maximum frequency, this is a computing challenge

even for modern processors (Intel Sandy Bridge CPU i7-2600K@3.4GHz compiles
the design described in this work in 9 minutes, although on a single core; and this
is a relatively small design). For this reason, appropriate hardware is necessary
for the development.
Figure 2.2: Example FPGA development software[20]
2.3 Future possibilities of FPGAs

Currently, the fastest performing FPGA is probably from the device family Speed-
ster22 from Achronix[8]. Since the major performance limiting factor in current
FPGAs is the interconnect, the Achronix device avoids this bottleneck by time
multiplexing the routing resources. By doing this, the Speedster22i device is ca-
pable of providing 1.5GHz peak processing performance. Since today’s highend
is tomorrow’s lowend in the semiconductor industry, we may see a rapid increase
in processing power of even low cost FPGAs in the coming years.
The discovery of memristor[9] may be an important step towards developing
new generation FPGAs. HP is currently developing a memristor based FPGA[10].
The standard PC architecture may also include elements of FPGA fabric in
the future or be entirely replaced by programmable logic. This is signified by the
Intel Stellarton[11] CPU, which includes an Intel Atom processor together with
an Altera Arria II FPGA die in a single package. The FPGA is currently used as
an H264 encoding accelerator.
7
2. I NTRODUCTION
2.4 Video processing on an FPGA

Processing a video stream usually involves operations on either the video sig-
nal timing or on the raw bitmap data of individual frames or fields. The FPGA
architecture is well suited for video processing for the following reasons:
· Video timing generation is relatively straightforward with an FPGA. Even

the logic fabric of low cost FPGA families is usually capable of support-
ing 150+ MHz IP components, therefore allowing generation of HD resolu-
tions.
· Processing the raw frame data can take advantage of the hardened DSP
blocks to ease the timing requirements for the logic fabric itself. Together
with pipelining the individual algorithm operations, this allows the design
of complex video processing paths even with HD resolutions.
· By being ”close to metal”, the algorithms on an FPGA can be more effective

in terms of power than systems using an CPU core to perform the process-
ing functions.
· Due to the FPGA flexibility, the video processing path can be tailored to
specific project requirements.
· The flexibility of the FPGA architecture may prove useful for small pro-
duction series, where the development costs of ASIC solution may be pro-
hibitive.
For these reasons, the processing functions required for the project described
in this work were implemented on an FPGA.
8
Chapter 3
Broadcast video transport standards

Today, with a few exceptions (e.g. the VGA interface), the video signal represen-
tation transitioned from analog to digital domain. The most obvious advantage
of digital representation over analog is that the video data is not in any way al-
tered by the transmission. With analog representation, this was not possible due
to effects like noise and line losses, which in most cases corrupted the transmitted
information.
Regardless of the selected video interface standard, video data is divided into
discrete images called frames. A frame is an bitmap image, transferred over the
transport interface from top to bottom line by line, with each image line being
transmitted from left to right. Therefore, the transmission of a frame starts with
top left pixel and ends with bottom right pixel. The rate at which the video frames
are transferred is called a frame rate.
Figure 3.1: Progressive frame structure[12]
9
3. B ROADCAST VIDEO TRANSPORT STANDARDS
The video format can be either progressive or interlaced. In progressive video

stream, a frame is transferred in whole, meaning it is a complete representation of
the video image in one point in time. With interlaced stream, frames are divided
in halves called fields. Fields can be either odd or even, where odd field contains
odd lines of the frame and even field contains even lines. When the stream is
transferred as interlaced video, the motion appears smoother because this format
effectively doubles temporal resolution of the stream (compared to a progressive
stream with the same resolution and bandwidth).
Figure 3.2: Interlaced frame structure[12]
The video data represent the scene in some predefined color space. The most
commonly used color spaces are RGB and YCbCr. With RGB color space, the
pixel has red, green and blue component to identify it’s color. The RGB standard
is widely used in the PC industry for video data representation and as graph-
ics card output format. When using YCbCr color space, the pixel has luminance
(brightness) and chrominance (color) coordinates to identify the color. Conver-
sion between these color spaces can be from straightforward to fairly complex,
depending on the requested conversion quality.
The horizontal and vertical resolution of the frame, frame rate, color space and
progressive/interlaced identifier together form a video format. Video formats are
standardized by organizations such as VESA[16] or SMPTE[17].
This chapter gives an overview of video transport standards used for video
input and output of the presented video processing system.
10
Figure 3.3: YCbCr color space at 0.5 luminance[13]
3.1 Parallel digital data
The representation of video data as a parallel clocked bus is most common when
connecting different integrated circuits on a printed circuit board. The bus con-
tains a master clock signal, horizontal and vertical synchronization signals, active
picture indicator (data valid signal) field identifier for interlaced formats and the
video data itself. This format with separate horizontal and vertical synchroniza-
tion is most commonly used, probably for its universality.
Although embedded synchronization can be used (synchronization signals
are not separate wires but are embedded as special sequences directly in the
video data), this may cause design complications when using video processing
ICs which each expect differing embedded synchronization sequences because of
differing standards (e.g. BT656 vs BT1120).
The parallel transmission format requires that the appropriate individual bit
wires have their lengths closely matched to each other to ensure that the pixel
wavefront is properly aligned at the receiver side. With today’s high resolutions
and therefore high pixel clock rates, this data format may also cause problems
with signal crosstalk or reflections from impedance differences, therefore it is a
good practice to use some kind of termination at both the transmitter and receiver
sides.
11
3.2 Serial digital interface (SDI)

Serial digital interface[18] is a video transport standard used mainly in broadcast
and medical industries. It uses shielded coaxial cable as a medium and allows for
transfer rates from 270Mbit/s to 3Gbit/s. It can be thought of as an serial encap-
sulation of parallel digital data. On the transmitting side, the data is serialized
to a high speed serial form and on the receiving side data is deserialized back to
parallel format.
Figure 3.4: An example of SDI connector[14]
SDI uses NRZI encoding scheme to encode data and a linear feedback shift
register to scramble the data to control bit disparity. The video stream can also in-
clude CRC (Cyclic Redundancy Check) checksums to verify that the transmission
occurred without an error.
3.3 Digital Video Interface (DVI)

DVI is an interface to transfer digital video and is used frequently in the PC in-
dustry. The interface uses TMDS (Transition Minimized Differential Signaling) to
transfer data over four twisted pairs (three for data and one for clock) of wires.
Because this interface is frequently used to connect a graphics port of a computer
to a display, DVI also includes support data channels to allow the computer to
identify the device being connected. This interface is called EDID and is basically
a serial EEPROM with information about the display vendor and supported res-
olutions.
This interface can be also thought of as an serial encapsulation of parallel data,
but compared to SDI it uses three serial data channels to transport the data. This
reduces bandwidth requirements for a single serial channel and therefore reduces
the quality requirements for used cabling.
12
Chapter 4
Project requirements
This chapter describes the various requirements for the processing hardware. The
device using the FPGA video processor is to be used in an medical environment
for displaying live video from endoscopic cameras during surgeries. The system
also has to be able to record the video and store the feed either locally or via
network, but these functions are handled by a standard x86 system embedded in
the device and as such are not the topic of this work.
4.1 Video deinterlacing

Based on customer requirements, the video processor must handle two input
video formats, one progressive and one interlaced video feed. This requirement
comes from the fact that with this system, a HD camera will be usually delivered
which has two settings for output video resolution, 720p and 1080i. Since the cus-
tomer wants to be able use a standard monitor (most of which do not handle
interlaced video timings very well), the 1080i interlaced video must be internally
converted to 1080p. This video format can be displayed on a standard monitor
with no timing problems.
4.2 Low latency

The system is to be used for live video display during surgical operations. The
device processes the video signal from endoscope which is then output on a mon-
itor. The surgeon navigates by the displayed video image and so the processing
delay must be as small as possible. If the delay was too large, the surgeon would
see the operating tool later than he or she may do a critical intervention to the
patient and would therefore be a hazardous behavior.
4.3 On-screen display generation

When displaying live video from the endoscopic camera, the system also has to
mix into the picture some additional information. This information includes pa-
tient name, system settings, buttons for touch controls if the attached monitor has
a touch panel and an indicator of free space available for the recorded video.
13
4. P ROJECT REQUIREMENTS
4.4 Video stream switching

One of the features that the customer requested was the ability to display both the
live video feed and an administrative GUI application running on the system on
a single monitor. From this stems the requirement to switch between two video
streams seamlessly, not to cause the attached display monitor to resynchronize to
a new timing should the transition be made by a simple switch.
4.5 Image capture

The system must be able to take snapshots of the displayed video feed. Although
this could be handled by the embedded x86 system in a similar way as the video
recording, because of another request by the customer that the captured image be
freezed for a few seconds for a surgeon to see what the picture is, it was decided
that this function will be handled by the hardware.
14
Chapter 5
Device family selection

This chapter discusses the selection of FPGA device family to realize the required
functions of the system. After preliminary tests of video processing components
on a separate board developed for said testing, it was concluded that even low
cost FPGA families from major manufacturers were sufficient to implement Full
HD video processing. Based on this conclusion, the family selection was limited
to low cost field programmable gate arrays.
FPGA families are usually divided into several generations, each generation
contains devices with varying sizes and features and each device is manufactured
in various packages and speed grades.
5.1 Design requirements

The project requirements described in previous chapter were transferred to de-
sign requirements for the FPGA chip performance and required peripheral func-
tions.
Since the design seemed to most likely require a frame buffer component,
some form of large temporary memory was needed to store the incoming video
frames. It was decided that the system will use DDR2 memory for it’s relatively
low cost and sufficient performance. Based on the incoming video formats spec-
ified by the customer, the required memory bandwidth for the frame buffer was
estimated (in bytes):
1920(width) × 540(height) × 60(f ps) × 4(Bpp) × 2(R + W ) = 474M B/s

Including a margin for read/write bank switching and memory refresh cycles,
it was concluded that a single DDR2 x16 chip fulfills this bandwidth requirement,
since (in bytes):
2(datawidth) × 2(ef f ectiveperclock) × 200000000(f requency) = 800M iB/s

Therefore, the target device must be able to instantiate a DDR2 x16 memory
controller core to interface to the external DDR2 x16 memory chip.
The total number of pins required was estimated to be in the range of 150 to
180. This included two video inputs, USB link connection, DDR2 memory inter-
face and support I/O functions of the FPGA.
15
5. D EVICE FAMILY SELECTION
The maximum frequency required for any part of the design was estimated
to be 150MHz-180Mhz for the most demanding components. Namely the DDR2
memory interface and the deinterlacer module.
The selection of FPGA device family was based on these requirements to-
gether with a preference of wide availability and good online support.
5.2 Altera Cyclone family

Altera manufactures low cost FPGA chips under the Cyclone family name. This
family includes devices from 3000 logic elements (LEs) to about 150000 LEs. The
FPGA chips of this family also contain up to several megabits of embedded mem-
ory blocks, multipliers for DSP processing and are offered in a range of package
sizes and pin counts. The Cyclone family supports the instantiation of an DRAM
memory device controllers.
Figure 5.1: Altera Cyclone IV family logo[5]
The Cyclone family is currently divided into four generations, Cyclone I to

Cyclone IV (as of time of writing of this work, the Cyclone V family is announced
by Altera with available samples, but mass production of this family is planned
to 2012). These generations differ in power consumption, densities, supported
peripheral features and the maximum frequency the logic fabric of the device is
able to support for a given HDL design.
The family generation selection was reduced to Cyclone III and Cyclone IV.
These families are more advanced than the I/II generations, due to advances in
lithographic processes are cheaper and have better availability. Also, due to the
Cyclone IV being basically a ”shrink” of Cyclone III, the conversion of a given
design between these families is a relatively simple task.
5.3 Xilinx Spartan family

The other major manufacturer of Field Programmable Gate Arrays, Xilinx Inc., of-
fers device families with similar features as Altera. The Xilinx version is branded
under the name Xilinx Spartan.
The Spartan devices are also divided into device generations based on ad-
vancements in FPGA design. The device families considered were Spartan-6 and
Spartan-3 due to a relatively large community support for designs based on these
devices. The FPGA chips from the Spartan-6 device family include hardened
memory controller blocks for interfacing an external DRAM memory chip.
16
5. D EVICE FAMILY SELECTION
5.4 Lattice Semiconductor Corporation

Lattice Semiconductor is the third largest FPGA manufacturer and although it
was also taken into consideration, for a perceived lack of good online support the
devices from Lattice Semi were not given any further evaluation.
5.5 Final selection

The device family selected to implement the requested functions of the system
was Altera Cyclone III/IV. This decision was influenced by several factors.
The low cost FPGA devices from Altera are on par with low cost devices of-
fered by Xilinx when comparing features like price, performance, capabilities and
package options.
Since the selected manufacturer will probably be also used in future projects
requiring some form of FPGA processing, availability of IP cores was taken into
account. Since the company is trying to enter into medical video processing mar-
ket, it is necessary to have video processing cores available. Although there exist
many for the Xilinx devices, Altera offers a complete package for video process-
ing, the Altera Video and Image Processing Suite (VIP)[12].
Both manufacturer’s FPGA development environments were evaluated, the
Altera Quartus II and Xilinx ISE Design suite. It was concluded that the Altera
Quartus II is a better solution, because it integrates all required functions (design
entry, compilation, simulation, programming) into one package. Also taken into
account was the large availability of cores adhering to the Altera Avalon Intercon-
nect Fabric standard, which together with the SOPC Builder software simplifies
system design.
To provide a complete and realistic overview of the reasons influencing this
decision, it must also be noted, that one of the reasons tipping the selection into
Altera’s favor was the authors familiarity with devices of this manufacturer from
lectures at FI MU.
17
Chapter 6
Evaluation of commercial IP cores from Altera

Before designing the final hardware board to be used in the device, we designed
an evaluation platform to test the video processing functions inside the FPGA and
the interface chips used to convert the different video transmission standards to
and from the FPGA input/output format.
The evaluation board included an Cyclone III FPGA with 40k logic elements
with the fastest speed grade available (EP3C40F484C6N). The FPGA had two
DDR2 memory channels available, each consisting of two 16-bit DDR2 memory
chips. The FPGA was connected to SDI input interface chips, DVI (TMDS) re-
ceiver, output DVI transmitter, USB communication bridge to allow for PC con-
nection and other support ICs.
On this board we evaluated the relevant IP cores to be used throughout the
project and later developed our own intellectual property components.
6.1 Video and Image Processing Suite (VIP)

The first to evaluate were the components from the Altera Video and Image Pro-
cessing Suite. We were mainly interested in the Deinterlacer and Switch IP cores.
The VIP cores can be instantiated either standalone or inside an SOPC system.
With both approaches, the user is offered a MegaWizard configuration interface
to select the required core functionality.
We used the core in the ”Bob - Scanline Interpolation” mode. This deinterlac-
ing method adds lines to each half field by calculating the missing odd/even lines
of the field. We selected this mode for that this interpolation method produces rel-
atively clean image with no visible artifacts from merging two fields (such as the
Weave method does) and since this algorithm uses only a few lines of image to
buffer data so it produces very little delay.
The visual quality of the processed video feed was found acceptable for the
project, but the stability of the IP generation was found unsatisfactory. For visual
quality testing, we used the IP core version integrated in the Quartus II pack-
age version 9.0. This version performed with no problems. When we switched
to the Quartus II version 9.1, we could not compile any design containing the
core. When the core was configured and the system was being generated (the
parametrization of the core was under way by the configurator), the Quartus IDE
crashed and was not able to recover. Since we had to use the 9.1 version (because
it included a Switch component which we needed and which was not included
18
6. E VALUATION OF COMMERCIAL IP CORES FROM A LTERA
in the 9.0 version IP library), we had to abandon using the provided deinterlacer
component from Altera and had to develop our own solution.
At the time of writing this thesis, when testing the deinterlacer core the com-
pilation runs without any problems. This incident illustrates that the FPGA de-
velopment toolchain is a rather complex software and should be thoroughly eval-
uated before considering using it in a design.
Figure 6.1: Example of the MegaWizard configuration interface[20]
6.2 DDR2 High Performance Controller II

We needed some form of large temporary storage memory to store the video data
when synchronizing two video streams (frame buffer) and to store the captured
image for the still image capture function of the system
We decided to use DDR2 memory because it is the newest DDRx standard
electrically supported on the Cyclone II/IV device family architecture. On the
evaluation board we had integrated two channels of DDR2 memory channels,
each consisting of two 16-bit DDR2 chips. This meant 64 bits effective transfer
size per clock and 128 bits smallest transfer size when considering DDR2 minimal
memory side burst size of 4 beats according to the JEDEC specification.
19
6. E VALUATION OF COMMERCIAL IP CORES FROM A LTERA
Regarding the memory access pattern, we needed to read and write sequential
areas of memory and therefore did not need the short memory side burst lengths
of DDR I memory, which could be more appropriate for other algorithms such as
realtime image rotation.
We tested the memory controller core by running the ”memtest” example in-
cluded in the Nios II Embedded Design Suite. The tests passed with no problems
and we therefore decided to use this core.
6.3 NIOS II soft processor

To control the FPGA hardware an softcore processor was needed. Altera provides
the Nios II 32-bit embedded processor for use on it’s devices.
The processor core can be configured into three versions, Economy, Standard
and Fast. Since we did not need any video processing functions done on the CPU,
we could use the Economy version on the core.
The JTAG debugging and communication feature of the Nios II EDS devel-
opment software proved very handy when debugging the system later in the
project.
20
Chapter 7
Selected system structure

This chapter describes the resulting internal FPGA video processor structure, This
setup emerged after several design iterations.
The structure took shape after considering the project requirements described
above. It was necessary to display both the video feed from endoscopic cam-
era and the administrative GUI application running on the internal x86 system.
Therefore the FPGA has two video inputs. It is necessary to display the output
video, so the FPGA has a video output. We needed to somehow communicate
with the PC for system control and captured image transfer. For this reason the
FPGA is connected to and USB FIFO bridge. To provide storage space for triple
buffering and captured image storage, the FPGA has an DDR2 memory attached.
The design was created using the Quartus II IDE. As a top level entity was
selected a schematic file to provide a clear way of showing the system structure
inside the Quartus project. Compared to a HDL top entity such as Verilog or
VHDL file, the schematic quickly shows how the individual blocks are connected
and communicates the information to the hardware designer.
The block diagram and individual components of the system are discussed
below. The components are covered only in brief detail, the three components
comprising the core of this work are described in detail in a separate chapter.
7.1 Block diagram

The system block diagram on figure 7.1 shows only the components relevant to
the video processing paths. Supporting components like clock domain crossing,
external support signals for the PCIe grabber, video resolution detectors, color
space conversion cores etc. are not shown to maintain clarity.
The system takes two video streams as input, processes them and outputs a
single video feed as output. The video inputs are the camera input and the PC
video input. The timing of the PC video input is taken as a reference timing, onto
which the camera video feed is synchronized and blended using the frame buffer
component. The frame buffer is also connected to the USB link providing a way
to ”dump” the contents of a memory location containing a captured image to a
PC.
One video processing path is the camera feed, another is the PC video feed.
The camera video has to be synchronized to the PC video signal timing and if in
an interlaced format it also must be deinterlaced. Placing the deinterlacer after the
21
7. S ELECTED SYSTEM STRUCTURE
frame buffer component saves memory bandwidth, since it allows for buffering
of the half fields only and the final full frame is calculated using the deinterlacer
after the synchronization phase. This also means that the images transferred to
the host x86 system are half fields (for the 1080i interlaced input video format)
and have to be stretched to original aspect ratio. It was found that this solution is
perfectly acceptable since there is no visible reduction of quality of the captured
image.
Camera
video input
USB link Frame buffer PC video

input
Deinterlacer Alpha
blender
Video output
Figure 7.1: Final video processing system block diagram
7.2 Camera video input

The component providing the video input to the system is designed as an Avalon
Interface Fabric compatible component. The input parallel video data from the
external SDI receiver chip are converted from YCbCr color space into RGB color
space using a simple pipelined calculation and then the data are fed into a dual
22
clock FIFO (standard component provided by Altera). This effectively transitions

the data from exact-time-formatted data for display into a data stream suitable
for internal processing. The remainder of the component is an Avalon Memory
Mapped Master externally controller by the Nios II CPU. The master component
can be thought of as an DMA engine, which converts a video frame into the DDR2
storage memory using long Avalon side bursts.
To relax the frequency requirements for the core logic fabric, the width of the
bus from the FIFO to the memory controller IP is set to 64 bits. This effectively
halves the frequency at which the bus must run to transfer the data and therefore
easies the fitter effort to reach timing closure.
This component is displayed as standalone in the block diagram but is effec-
tively part of the frame buffer subsystem.
7.3 Frame buffer

The frame buffer subsystem provides the means to synchronize the camera video
feed to that of the PC. Although this introduces delay (of at most one half field),
which could be avoided by synchronizing the PC video feed to that of the camera
instead, it was supposed that since the PC video feed comes from inside of the
device from the host x86 system, it is more reliable and ”under control” than the
unknown camera signal from outside of the system and is therefore more usable
as a reference timing signal.
The frame buffer uses a standard triple buffering scheduling algorithm, where
one buffer is always available to save an incoming video frame. This provides the
means to synchronize the two video streams, since the frame buffer scheduler can
either drop or repeat a field to match the required timing.
Together with writing the raw image data to the DDR2 memory, the scheduler
also registers whether the currently transferred field is odd or even. The scheduler
has the field signal available from the external SDI receiver chip. This information
is later used to properly configure the deinterlacer block at the output of the frame
buffer.
The frame buffer then includes an output component which reads a stored
field from memory and outputs the data into an input dual clock FIFO of the
deinterlacer component. The frame buffer component is described in detail in a
separate chapter.
7.4 USB link

The frame buffer subsystem also contains a USB link component on the Avalon
Interconnect Fabric. This provides the capability of the system to transfer the
stored image data to the PC using a USB FIFO bridge (FT2232H from FTDI[21]).
The size of a single half field in 1080i video format is about 4 megabytes and is
transferred in under two seconds.
23
The USB interface IC has two channels, one is configured for the RS232 stan-
dard and is used for FPGA system control, the other channel is a one way com-
munication link to the PC for captured still image transfers. The control channel
is connected to an UART component of a controlling SOPC system with the Nios
II soft core processor.
7.5 Deinterlacer
The deinterlacer component is fed the video data by the frame buffer component,
this video data is deinterlaced (if requested) into a full frame and output to the
alpha blender component. The deinterlacer core is described in a separate chap-
ter.
7.6 PC video input

The video data from the host x86 system is fed into the FPGA using an external
DVI receiver chip. The data passes into the alpha blender, where it is mixed with
the video feed from the camera and output from the FPGA into an external DVI
transmitter IC.
7.7 Alpha blender

The alpha blender IP core takes the two video streams and mixes them together
using an alpha value provided by the Nios II controller system. The alpha value is
controllable from the PC over the USB link and therefore allows for video stream
switching.
The alpha blender includes a transparent color definition. When the alpha
blender encounters a pixel with this color in the video data of the host x86 system,
the alpha blender displays the original pixel color from the camera feed regard-
less of the alpha setting. This basically provides the overlay function known from
the PC world. This was implemented to allow the system not only to blend the
two streams together, but to also enable the on screen display (OSD) generation.
The transparent color definition allows for displaying original camera video data
with non-transparent OSD mixed on top of this feed. The alpha blender compo-
nent is also described in more detail in a separate chapter.
24
Chapter 8
Example video processing cores

This chapter describes the IP cores developed to provide the video processing
functions of the system, as required by the project requirements. All the cores
were written using the Verilog HDL language. Compared to VHDL, the Verilog
hardware description language was perceived as more readable and ”developer
friendly”.
The cores process the video stream data in a stream format - the input compo-
nents of the processing chain convert the video data from the exact timing format
to a stream format, stripping the video data of the synchronization information
and forwarding only the active picture data.
8.1 Deinterlacer
Deinterlacing is used to convert from interlaced video format to a progressive
one. In interlaced video stream, each complete frame is transferred as two half
fields, odd and even. Odd field contains odd picture lines and even field con-
tains even picture lines. By splitting the complete frame info two half fields, the
temporal resolution of the video feed is doubled and the motion appears more
smooth.
Progressive video format transfers frames as complete units, each frame con-
taining all (odd and even) lines of the picture. Progressive video does not have
the same temporal resolution as an interlaced video with the same bandwidth,
on the other hand it offers better vertical resolution and therefore more detailed
image.
Interlaced format is commonly used in broadcast applications and TV indus-
try, whereas progressive format is more common in the PC industry.
8.1.1 Algorithm overview

A system converting an interlaced video signal to progressive is called a deinter-
lacer. There are several methods on how to accomplish the conversion:
· Bob - line duplication
· Bob - line interpolation
· Weave - merging of lines of odd and even field
25
8. E XAMPLE VIDEO PROCESSING CORES
· Motion adaptive algorithms

”Bob” is a name given to algorithms needing only one half field to produce a
complete progressive frame. The individual approaches are described below.
Line duplication
Line duplication algorithm simply takes the input line and produces two lines
on output, each same as the image line on deinterlacer input. This is the simplest
deinterlacing algorithm, however also the one with the lowest output image qual-
ity. Since the half field lines are duplicated, the output progressive image appears
pixelated in the vertical direction. This is especially visible on sharp, highly con-
trasting edges in the image.
Figure 8.1: Output of the line duplication algorithm[23]
Line interpolation
Line interpolation algorithm does not replicate the missing lines, but instead it
calculates the missing line from the line above and below the missing one. This
produces a complete frame from a single half field, with the quality of the output
image better than the line duplication algorithm. The most visible improvement
is that the sharp contrasting edges appear more smooth thanks to the interpolated
lines.
Weave algorithm
The weave algorithm uses two half fields to produce a progressive output frame.
The method works by merging the odd and even fields directly into one frame.
26
Figure 8.2: Output of the line interpolation algorithm[23]
Compared to the Bob algorithms this method needs a storage memory to tem-
porarily store the half field data. This also introduces a half field delay to the
processing chain since the deinterlacer must wait for complete field to produce
an output progressive frame. The output quality of this algorithm is compro-
mised by artifacts on edges in the resulting progressive image; since the fields
used to produce the output originate in different points in time, when the video
feed contains scenes with fast movement, the edges appear distorted since each
field captures the moving object in a different position.
Motion adaptive algorithm

Motion adaptive algorithms try to predict the areas of the image with movement
and try to compensate for the motion by calculating the final progressive frame
from several preceding half fields of the interlaced input. Additionally to the re-
quirements for storage of the preceding half fields, this algorithm also introduces
delay to the video processing chain. This delay depends on the specific motion
adaptive algorithm used.
8.1.2 Algorithm selection

After testing the above mentioned algorithms on a development board, we se-
lected the line interpolation algorithm. The quality of the output image was found
acceptable, since the edges appear smooth and there are no ”saw tooth” artifacts
as is the case with the weave algorithm. Also, since this method does not need
any preceding half fields to produce an output frame, the latency introduced into
the processing chain is very small - typically the duration of a single image line.
27
Figure 8.3: Output of the weave algorithm[23]
8.1.3 Principle of operation

The deinterlacer core uses the line interpolation algorithm to convert the input
interlaced video to the progressive output format. The input data in stream for-
mat are fed to the core by the frame buffer component. The output of the core is
connected to the alpha blender core, where it is mixed with a second video feed
and output to the external DVI transmitter chip.
The core is reset with the beginning of each input field. After the reset signal
is deasserted, the core detects whether the current field is odd or even and also
registers the video format resolution as detected by the preceding components of
the video processing chain. The core uses two counters, x and y to store the actual
position within the video frame. The core has three options as to what to do with
each processed line:
· A - store the incoming line into the temporary buffer, and, at the same time,
output the line
· B - store the incoming line into the temporary buffer, and, at the same time,
output the average (interpolation) of the line being currently stored in the
temporary buffer and the line already saved in the temporary buffer
· C - do not store anything, just output the line already stored in the tempo-
rary buffer
The decision between performing the action A, B or C is made by the core
scheduler. This part of the deinterlacer keeps the current position in the video
image and performs configuration of the remaining parts of the component at
the beginning of each image line.
28
8.1.4 Implementation
The core is implemented as a schematic file instantiating the subentities designed

in Verilog hardware description language. The core also uses Altera specific com-
ponents included in the Quartus IP library.
The core has three main groups of virtual I/O pins exported to the higher
level design file - video stream input, video stream output and control signals.
Video stream input pin group is composed of signals fifo data input[63..0],
fifo rdempty input and fifo rdreq output. These signals form an interface to the
FIFO data buffer of the frame buffer component.
Video stream output pin group is composed of pins out rdreq, out rdclk and
out data[23..0]. These signals provide the interface to the alpha blender compo-
nent mixing the two streams to form the output video signal.
The remaining signals form the deinterlacer core control signals. The main sig-
nals in this group are clock, reset, video resolution and the deinterlacing enable
signal to optionally disable the deinterlacing to let progressive video format pass
through unchanged for the 720p progressive input camera video format.
The core scheduler is located in the deinterlacer controller submodule. This
module controls the state transitions at the beginning of each image line as de-
scribed in the previous chapter. The core scheduler algorithm is the following:
i f ( can advance == 1 )
begin
x = x + 1;
i f ( x == x s i z e )
begin
x = 0;
y = y + 1;
i f ( f i e l d == 0 )
begin
i f ( y == 0 ) m a s t e r s t a t e = 2 ;
i f ( ( y >= 1 ) & ( y < ( y s i z e m i n u s o n e ) ) )
m a s t e r s t a t e [ 1 : 0 ] = { 1 ’ b0 , y [ 0 ] } ;
i f ( y == ( y s i z e m i n u s o n e ) ) m a s t e r s t a t e = 0 ;
end
else
begin
i f ( y == 0 ) m a s t e r s t a t e = 2 ;
i f ( y == 1 ) m a s t e r s t a t e = 0 ;
i f ( ( y >= 2 ) & ( y < y s i z e ) )
m a s t e r s t a t e [ 1 : 0 ] = { 1 ’ b0 , ˜ y [ 0 ] } ;
end
end
end
29
The variables x, y contain the actual position within the video image data, field
is the even/odd field indicator, master state[1:0] is a variable indicating which
of the actions A, B or C should the deinterlacer perform on the actual line and
can advance is a signal indicating that the remaining core components are ready
for next data item.
The deinterlacer ram buffer module is Altera-specific instantiation of an em-
bedded memory block forming a RAM memory to store the image line. The ad-
dress to this embedded RAM memory block is controlled by the scheduler, the
deinterlacer mem addr delay module delays the address signals for the line op-
eration B. The operation B means that the deinterlacer must store the incoming
line to the RAM buffer and at the same time load the data from the very same
memory buffer. Therefore, it is necessary that the data from the buffer can be
read out before the new image line data are saved to the buffer.
The deinterlacer line switch module provides the switching between opera-
tions A, B and C as requested by the scheduler module. Operation A (master state
= 2) means that the data received from the frame buffer component is stored
to the RAM buffer and at the same time the data is routed through the dein-
terlacer line switch to the output FIFO. Operation B (master state = 1) means
that the incoming data is stored to the RAM buffer and at the same time the
previous line data stored in the RAM buffer are read out, sent to the deinter-
lacer line switch where the pixel data is averaged (interpolated) with the actual
line data and sent to the output FIFO. Operation C (master state = 0) does not
read the incoming pixel data but instead simply outputs the stored line from the
RAM buffer to the output FIFO.
The remaining components of the deinterlacer core are mainly support func-
tions to properly align the individual data and control signals to compensate for
the latency of the respective communicating components.
To relax the requirements for the maximum frequency of the device logic fab-
ric, the deinterlacer core processes two pixels at a time. This doubles the used
data bus width, but at the same time allows to halve the operating frequency
while maintaining the required bus bandwidth.
The deinterlacer core expects the field data in a standard RGB color space with
every color component having 8 bit value range (0.. 255).
The interpolation (vertical averaging) of the neighboring half field image lines
is done by adding the individual red, green and blue components of the pixel
color (the two pixels in the RAM buffer from the previous image line and the two
pixels currently being received and stored to the RAM buffer) together and then
doing an one bit position shift right, thereby calculating an arithmetic average of
the two values.
30
Figure 8.4: Top level entity of the deinterlacer component[20]
31
8.2 Alpha blender

Alpha blending is an image processing algorithm for mixing two images into
one, with the option to select the transparency of individual picture elements. In
video stream processing, the input images are formed by the active picture data
of the individual video frames. The transparency is selected by the alpha channel,
which for each pixel defines a transparency value. The range of the transparency
value 0.0 to 1.0 can be translated to integer representation, for example with 8-
bit resolution the range is 0.. 255. The value 0 means that the first image is fully
visible with no visual input from the second one and vice versa.
Value of the final pixel is usually calculated by calculating the individual ele-
ments of the pixel color for each coordinate in the pixel’s color space. For exam-
ple, with the RGB color space, the calculation can be described by the following
equations:
outR = layerAR ∗ layerAalpha + layerBR ∗ (1 − layerAalpha )

outG = layerAG ∗ layerAalpha + layerBG ∗ (1 − layerAalpha ) (8.1)
outB = layerAB ∗ layerAalpha + layerBB ∗ (1 − layerAalpha )
The alpha value for each pixel can be either fixed for the entire image or deliv-
ered to the blender core as a separate value for each individual pixel, for example
as the unused 8 bits within 32-bit pixel memory window for 24-bit pixel colors. In
this work, the blender core has a fixed value for the alpha channel for the entire
active picture window. Although initially was the per-pixel alpha channel con-
sidered, to provide a simple way for the OSD menu generation, the fixed alpha
solution was preferred. The main reason for this decision was that the PC video
feed is used as the source for the OSD menu and it would be problematic to trans-
mit the alpha channel through the standard 24-bit per color DVI interface. Using
the fixed alpha value, the entire range of the pixel value of the DVI interface can
be used for pixel color space coordinates and the OSD generation is achieved by
simply displaying an image on the x86 host system graphics output.
This solution also has its drawbacks, most notably the inability to display a
non-transparent OSD image on top of the live camera video feed. This was re-
solved to dedicating a single pixel color from the x86 host system a the transpar-
ent color value. When this color is encountered by the blender core, the value of
the camera video pixel is assigned to the output, regardless of the alpha value set-
ting. This allows for the generation of either non-transparent or semitransparent
OSD image on top of the live video feed.

The core processes two input pixel streams and produces a blended pixel stream
on the output. One input stream is a directly connected video feed from the x86
host system, which is used as a reference video signal for the output video feed.
32
This means that the output video feed has the same parameters (pixel clock, tim-
ing, resolution) as the video feed from the x86 host system. Into this video feed is
mixed the live video signal from the camera input using the preceding frame
buffer and deinterlacer components. This allows the system to mix these two
streams with no interruptions in output video timing, since the camera feed is
passed through the frame buffer component and can be therefore matched to the
reference video signal.
The calculation of the output pixel value is divided into separate calculations
for each color component of the pixel color. Each calculation of the output color
component is then further divided into pipelined calculation stages to relax the
timing requirements of the design compared to the case with no pipelining done.
For the calculation of output values the blender core uses the equations 8.1
translated into the integer domain.
The blender component is implemented as a Verilog HDL entity, instantiated in a

higher level schematic design file in the Quartus design environment.
The reference input video signal is fed to the core using the pixel b in[23..0]
bus together with the reference video timing signals de in, hsync in and vsync in.
The core is clocked using the reference video signal clock connected to the core
clock input clock in.
Figure 8.5: Schematic symbol for the blender module[20]
The output video signal is formed by the output pixel out[23..0] together with
the timing control signals de out, hsync out and vsync out. The output video
feed uses the same clock as the input reference video feed, i.e. clock in.
Following is a code walk through for a single color component (red). The core
starts by registering the input information to reduce the length of the input path
and therefore to improve the maximum operating frequency of the core.
always @( posedge c l o c k i n )
begin
p i x e l a <= p i x e l a i n ;
p i x e l b <= p i x e l b i n ;
end
33
Now the core has the input pixel information available in the internal regis-
ters. The alpha value for the current video frame is registered during the vertical
blanking interval of the reference video signal. This way, the alpha value is forced
to be the same for each individual video frame. To further improve the maximum
frequency of the core, the alpha value for both video inputs is immediately calcu-
lated (the layerAalpha and the (1 - layerAalpha ) expression as described in equations
8.1).
begin
i f ( v s y n c i n == 1 )
begin
a l p h a a [ 7 : 0 ] = alpha [ 7 : 0 ] ;
a l p h a b [ 7 : 0 ] = 8 ’ d255 − alpha [ 7 : 0 ] ;
end
end
The blender core then continues by calculation of the intermediate values for
the expressions listed in (8.1). The core produces intermediate values for pixel a
and pixel B color components. Since there were some problems with the integer
representation of the equations (uneven mapping of the multiplication results,
the value for component output with layerXalpha = 255 was 254), the core checks
for the alpha value and if found maximal, the core simply outputs the respective
color components. If this is not the case, the core performs an integer multiplica-
tion of the pixel color component and the alpha value.
begin
i f ( a l p h a a == 2 5 5 )
begin
r e d a [ 1 5 : 0 ] = { p i x e l a [ 7 : 0 ] , 8 ’ b00000000 } ;
re d b = 0 ;
end
else red a = p i x e l a [ 7 : 0 ] * alpha a ;
i f ( a l p h a b == 2 5 5 )
begin
re d b [ 1 5 : 0 ] = { p i x e l b [ 7 : 0 ] , 8 ’ b00000000 } ;
red a = 0 ;
end
e l s e re d b = p i x e l b [ 7 : 0 ] * a l p h a b ;
red a pipe = red a ;

r e d b p i p e = re d b ;
end
The intermediate multiplication values are pipelined using additional regis-
34
ters red a pipe and red b pipe. The core then continues by producing the final
pixel output value. In this step the core also checks for the transparent color as
described above and decides whether to output the pixel value based on the pre-
vious calculations or whether to output the camera video feed pixel value directly.
The color selected as the transparent color for the video overlay is 0xFF00FF (ma-
genta), considered very unlikely to appear in the x86 host system video output
under normal conditions.
begin
i f ( pc 1 == 2 4 ’ hFF00FF )
begin
red [ 1 5 : 8 ] = cam 1 [ 7 : 0 ] ;
green [ 1 5 : 8 ] = cam 1 [ 1 5 : 8 ] ;
blue [ 1 5 : 8 ] = cam 1 [ 2 3 : 1 6 ] ;
end
else
begin
red = r e d a p i p e + r e d b p i p e ;
green = g r e e n a p i p e + g r e e n b p i p e ;
blue = b l u e a p i p e + b l u e b p i p e ;
end
end
The cam 1 register stores the camera input pixel value for the actually pro-
cessed pixel, the pc 1 register stores the original reference video feed pixel value.
Separate registers are necessary, since in this stage of the processing pipeline the
original pixel a and pixel b registers contain newer pixels due to the processing
latency. This is the last step of the processing pipeline.
To compensate for the latency introduced by the individual processing stages,
it is necessary to also properly align the output video timing signals to match the
active picture data. This is done by a simple delay stage inside the blender core.
begin
delay 3 = delay 2 ;
delay 2 = delay 1 ;
d e l a y 1 = { de in , hsync in , v s y n c i n } ;
end
The individual bits of the delay 3 register are then output as the final timing
control signals.
At first, the design of the core did not use any pipelining and the core had
very low performance of terms of maximum allowable frequency of the incom-
ing video signal. After introducing the pipelining stages, the core is capable of
handling 150+ MHz input video signal pixel clocks and therefore supports the
required HD resolutions (the pixel clock of the 1080p video format is 148.5MHz).
35
8.3 Frame buffer

Frame buffering is a method for synchronizing data producer and data consumer,
each running at a different processing rate. It can be used in situations where the
data stream is divided into compact units, suitable for processing on a per-unit
basis. This condition is fulfilled for video processing, since the video stream is
composed of individual video frames and these are usually processed separately.
Frame buffering works by allocating a number of memory buffer regions for
storing the incoming data segments. The number of buffers used depends on the
selected frame buffering method.
Data Data Data Data

producer consumer consumer producer
buffer switch
Buffer A Buffer B Buffer A Buffer B

(writing) (reading) (reading) (writing)
Figure 8.6: Principle of operation of a double buffering system
Double buffering uses two buffers, one for data producer and one for data
consumer. Use of double buffering method is limited to cases where the data
production can be controlled in a way as to work synchronously with the data
consumption. For example, this is the case with graphic cards in the PC industry.
To remove video tearing during the display of rendered scenes, the GPU renders
the scene into a different buffer than the one used to send the frame data to the
monitor. These buffers are flipped in the vertical blanking period of the monitor
output timing signal. This removes video tearing but at the same time it intro-
duces inefficiency to the render process, since the GPU has to wait with the start
of frame rendering process for the start of the vertical blanking interval (other-
wise the GPU has no free buffer to render the scene into).
For situations in which the data production can not be synchronized with data
consumption, double buffering is not optimal since this method has to drop a
large number of data units. When processing video, this behavior is clearly visible
in scenes with moving objects where the motion appears choppy and unnatural.
Triple buffering removes the problems of double buffering by introducing a
third buffer. This ensures that one buffer is always available for either the data
producer or the data consumer.
When considering using triple buffering for the synchronization of two video
streams, each having a different frame rate, this method can provide a simple
frame rate conversion. For example, if the synchronized stream has a lower frame
rate than the stream being synchronized to, the data consumer can reuse the cur-
36
Data Data Data Data

producer consumer producer consumer
data producer
has finished
storing data
Buffer A Buffer B and requests Buffer C Buffer B
(writing) (reading) a free buffer (writing) (reading)
Buffer C Buffer A
(idle) (idle)
Figure 8.7: Principle of operation of a triple buffering system
rently processed frame by duplicating it without any impact on the data producer
and the frame storage. The same is valid for an opposite scenario, where the data
producer stores the frames at a higher rate than the data consumer is processing
the frames. In this case, the frames can be dropped from the output stream, again
without any interruption of the data consumption process.
The frame buffer was implemented using an SOPC builder generated compo-
nent. The module uses several subcomponents connected together by the Avalon
Interface developed by Altera[25]. The main parts of the frame buffer module are
the frame writer component, frame reader component, usb reader component
and the necessary DDR2 memory controller core. By using the Avalon interface,
the system could take advantage of the Altera provided High Performance DDR2
Memory Controller Core.
The frame writer was based on an example template of Avalon Interface Mem-
ory Mapped Master with burst writes[26]. This is a template for a master compo-
nent located in an Avalon interface SOPC system. This component can be thought
of as an DMA (Direct Memory Access) engine with the source data not being in
the target address space. The component provides a bridging function between
the Avalon subsystem and user logic design. The original component includes a
single clock FIFO to provide the input from the user design into the Avalon sub-
system. For the purposes of this project this single clock FIFO was replaced with
a dual clock version of the same, therefore providing the clock crossing function
required to convert the video data from the camera video input clock domain to
the internal Avalon subsystem clock domain. There were also some other modi-
fications to the template, mainly regarding the start of the transfer. The bursting
capability enables the DMA engine to transfer large portions of the video data,
37
this greatly improves the efficiency of DDR2 memory access. The bursting capa-
bility is more discussed in the implementation section.
Figure 8.8: Structure of the read master template component[26]
The frame reader is also based on the Altera template, namely on an Avalon
Interface Memory Mapped Master with burst writes. This component works in
the same way as the frame reader, but it instead reads data from the DDR2 mem-
ory address space and transfers them using an dual clock FIFO into the user de-
sign. The user design is in this case the deinterlacer module.
The USB reader component is based on the same component as the frame
reader, but this time the output is connected to the external USB to FIFO bridge
(FT2232H[21]) from FTDI. The component was modified to adapt to the timing
required by the external USB bridge chip.
The last part of the frame buffer component is the Altera DDR2 High Perfor-
mance Memory Controller core. This component provides the means to access the
external DDR2 memory chip by mapping it into the Avalon subsystem memory
space.
This entire component is controlled by a second SOPC builder subsystem lo-
cated inside the FPGA. This subsystem provides the control functions for the en-
tire FPGA design and the peripheral devices, including the communication link
of the device to the PC using a second data channel of the FT2232H bridge, con-
trol of the timing of frame reading and writing of the frame buffer components,
input video format detection and other support functions of the FPGA system.
This subsystem is not described here due to the reasons described in the Preface
to this work.
38
As was said above, the component was implemented as a SOPC builder sub-
system. The system is clocked by a 81MHz clock signal generated by an PLL
component from an input 27MHz clock signal. To reduce the maximum operat-
ing frequency requirements, the design uses a 64-bit width of the interconnecting
Avalon bus fabric.
Figure 8.9: The frame buffer module as seen in SOPC builder[20]
Local side of the DDR2 memory controller core is the connection to the Avalon
bus fabric, memory side is the connection to the external DDR2 memory chip.
The DDR2 memory controller core is configured to run the external DDR2 16-bit
memory chip at 162MHz, the half frequency of the local side bus of the frame
buffer subsystem. Since both clocks are generated by a single PLL from the same
reference clock source, there is no need for a clock crossing bridge when moving
data to and from the DDR2 memory space. The memory controller is configured
to use a burst length of 8 beats on the memory side. This translates to 128-bits of
a local side data transaction being transferred on two clock periods of the local
side bus, since (in bytes):
2(DDR2W idth) × 2(DoubleDataRate) × 4(M emorySideClocks) = 16
This provides a highly efficient data path to transfer the video frame data. The
DDR2 memory controller core is also configured to allow for a local burst size of
64 words. This means that the core can be exclusively accessed by one of the three
masters on the bus (frame reader, frame writer, USB link) for a maximum of 64
data word transfers, further improving the memory access efficiency.
When testing the frame buffer subsystem on a custom development board,
the DDR2 memory channel had the width of 32 physical bits (there were two
physical x16 DDR2 chips, each storing one half of the 32-bit word) and the frame
buffer subsystem did not use any burst transfers. With this wide channel con-
figuration, the system worked fine. However, after trying to optimize the design
to use only a single x16 DDR2 memory device, it was found that the single x16
DDR2 device is not capable to provide the required bandwidth with this setup.
This was clearly demonstrated on the quality of the resulting output video feed,
from mild occasional tearing to a complete destruction of the output video struc-
ture after few seconds from the system startup. After some elaboration with the
39
SignalTap II logic analyzer it was found that the DDR2 memory is accessed in a
highly inefficient manner.
The memory address space is shared by three Avalon master components,
frame reader, frame writer and USB link. Since the master access to the slave
memory address space is scheduled by an round robin arbiter, this resulted in a
high number of accesses for a very small data items. The memory controller had
to continuously switch either the currently accessed row and column or switch
between read and write operations. Both these events cause a high performance
penalty by introducing wait cycles in between the individual operations.
For this reason, the then-used Avalon example templates were changed to
the bursting capable versions and the then-used memory controller core was
changed to the burst capable version. This resulted in a significant performance
improvement of the system, since now the memory core was accessed for very
large transactions, diminishing the performance penalties of the ”small and fre-
quent” access pattern.
The read master template used as a base for the custom project components
was modified by adding a dual clock FIFO instead of the originally used single
clock. Because the Verilog module source instantiates the FIFO as a standalone
entity and the interfaces are relatively similar for both the dual and single clock
version, the modification was fairly straightforward.
the master to video fifo the master to user fifo (
. aclr ( control reset ) ,
. data ( m a s t e r r e a d d a t a ) ,
. rdclk ( fi fo rea d clo ck inp ut ) ,
. rdreq ( f i f o r e a d r d r e q i n p u t ) ,
. wrclk ( c l k ) ,
. wrreq ( m a s t e r r e a d d a t a v a l i d ) ,
. q ( fifo read data output ) ,
. rdempty ( f i f o r e a d r d e m p t y o u t p u t ) ,
. wrusedw ( f i f o u s e d ) ,
. wrfull ( )
);
The other significant modification to the reader template was the inclusion
of the FT2232H FIFO interface for the USB link component. This was needed to
properly present the parallel data to the external chip.
The frame writer component works in the same way as the frame reader com-
ponent with the data direction reversed. The input video data from the camera is
fed to the input dual clock FIFO by using the data enable signal of the video tim-
ing information. By using this signal as a write enable signal for the FIFO, only
the active picture data are stored to the FIFO component.
40
Chapter 9
FPGA design flow

This chapter does not describe the FPGA design flow in the traditional sense -
design entry, synthesis, fitting phase, timing analysis, programming - since there
is plenty of information available on this topic elsewhere[24]. Instead, it tries to
provide a real world experiences and hopefully useful tips for designing an em-
bedded system containing an FPGA.
This chapter focuses mainly on the Altera Quartus II development software,
but the principles can be used for design flow with any FPGA device vendor.
9.1 Separate projects for custom components

When developing a FPGA system with custom designed IP components, develop
the custom cores as a standalone project (where possible). This allows the de-
signer to focus on the component functionality rather than on the interactions
with the rest of the system, reduces the design iteration compilation time and
allows for component reusability.
9.2 Use standard interfaces

The system design can be greatly simplified by using standard interfaces wher-
ever possible. A good example is the Avalon Interface used in SoPC designs using
Altera devices. By designing the component interface according to a standard, the
designer can then use the component together with IP cores developed by third
parties. This saves development time and provides greater flexibility than by de-
veloping a custom interface.
9.3 Optimize the memory access pattern

When the design uses some form of DDRx memory for data storage, think about
and optimize the memory access pattern. The most common caveats are not using
burst transfers for sequential data access and inefficiencies in bus width adapters.
For example, the Nios II softcore processor (at least the current version) ac-
cesses memory in 32-bit words. When the memory controller core has a different
local side width, this may result in an inefficient transfers since e.g. half of the
41
9. FPGA DESIGN FLOW
data word is discarded. This may hold true for other components of the Avalon
fabric as well.
The memory access pattern is an important design decision as well. As de-
scribed in the previous chapter, the differences in memory bandwidth between
bursting and non-bursting transfers can be huge. There is plenty of information
available on how to optimize the access patterns for a given task[27][28].
9.4 SignalTap II logic analyzer

A very useful feature of the Quartus design environment is the SignalTap logic
analyzer. When included into the project, the SignalTap component creates a logic
analyzer from unused device resources and allows the designer to see the system
behavior in real time. This is especially useful for debugging, where by setting
the appropriate trigger conditions the designer can observe the real behavior of
problematic parts of the design.
Figure 9.1: SignalTap II logic analyzer interface[20]
9.5 Horizontal and vertical device migration

When designing a particular FPGA system, it is useful to include the option to
migrate to a different device or even to a different device family with a single PCB
layout. This adds more robustness to the project, because the hardware board
is not limited to a single device and thus the system can be assembled even if
the original device is not available. This mostly concerns the physical I/O pin
mappings, where the designer has to do an ”intersect” of the I/O pins of the
devices considered for migration. There is also an added flexibility should the
design requirements change - device with more or less resources can be used
instead of the originally planned one.
For example, with the recent earthquake and subsequent tsunami in Japan, the
Altera supply chain was damaged and the target device for the project (Cyclone
IV, 15k LEs, FBGA256) was not available from any major electronic components
distributor. By designing the project for two device families (Cyclone III and IV),
the system was assembled using Cyclone III family device (Cyclone III, 16k LEs,
FBGA256) and the project schedule was not affected.
42
9. FPGA DESIGN FLOW
9.6 Physical I/O pin mapping

When mapping the design I/O pins to a specific device, the designer should work
closely with the layouter (the person designing the printed circuit board where
the FPGA chip is placed). By placing the design pins on optimal I/O positions of
the device, the FPGA designer can reduce the complexity of the printed circuit
board and therefore decrease the layout time and the cost of the PCB by reducing
the number of required layers.
43
Chapter 10
Resulting hardware
After the evaluation of the design on an development board a final hardware
board was developed for the FPGA video processor. Based on the design require-
ments a target device was selected to fit the design. The device selected from the
Cyclone IV family was EP4CE15F17C7N with the option to migrate to an older
Cyclone III generation with a device EP3C16F256C7N. These devices have very
similar pinout and therefore can be interchanged with no major problems.
Figure 10.1: Final hardware board containing the FPGA video processor
The entire system resource usage (after optimizations done by the synthesis
and fitter compilation phases) is 13565 logic elements (88% usage), 313072 mem-
ory bits (61% of the device resources) and 151 pins (91%). An illustration of the
final pin usage can be seen in the appendix section to this work. Of this resource
usage the deinterlacer block uses 501 LEs, alpha blender uses 228 and the frame
buffer component uses 8269 LEs. The high resource usage of the frame buffer is
mostly caused by the DDR2 memory controller core (6304 LEs) and Avalon bus
arbitration (776 LEs). The reader and writer components use about 400 logic ele-
ments each.
The DDR2 x16 memory is connected to banks 3 and 4, the other I/O pins are
mostly used to connect to the parallel video data buses of the two video input ICs
and one video output chip.
44
10. R ESULTING HARDWARE
10.1 Verification of the hardware

The design was verified by several methods. The developed cores were simulated
using (at that time) the Simulator utility integrated in the Quartus software. The
designs were simulated using functional simulation to verify the correct func-
tionality of the module. After successful simulation the cores were tested in a real
system to validate correct function. The behavior of the core under test was again
checked by the SignalTap II logic analyzer.
After these stages the entire FPGA system was developed, placed on a real
hardware board, assembled into a final prototype and this prototype was thor-
oughly tested. Visual image quality was checked, communication with the PC
and overall device stability were evaluated. The prototype also underwent ther-
mal stress testing to validate the functionality in the entire operating temperature
range as required by the design specification.
45
Chapter 11
Conclusion
This work presented the overview of the development process of a hardware
device using Field Programmable Gate Array and provided some example IP
components for video processing to illustrate the possibilities and approaches
used for said development.
The capabilities of the selected low cost FPGA family from Altera was found
to satisfy the requirements for processing power with no major problems. The
flexibility of the FPGA architecture proved very useful since there were many
minor modifications to the system behavior throughout the development of the
project. Although the per-chip price of FPGA is still higher than that of the ASIC,
for small production series this difference is balanced by the flexibility and the
ability to customize the design to the project specific requirements.
The overall development of the system took about one year, as of time of writ-
ing of this thesis the system is being prepared for production. The latency of the
displayed live video feed was found well within the required range, providing
the final users with realistic response of the system to the scene being displayed
in the video feed.
Except the cores developed by the author, the design uses many ready to use
cores provided by Altera. The experiences with their use are mostly positive, with
the exception of some of the Video and Image Processing Suite components which
had problems even to successfully compile. This illustrates the need to properly
evaluate the IP considered to use before the actual development.
Of course, the design could be done better. With retrospect, the design may
be done more effectively in terms of simplicity and overall readability. For ex-
ample, the frame buffer component is controlled by a SOPC controller with the
Nios II processor core. All the control signals and data busses are routed as con-
nections in the top level schematic file from the controller module to the frame
buffer module. Together with the frequent modifications to the design this cre-
ated a rather unreadable and messy mesh of interconnecting wires. Should the
system be developed now, all of the control signals would be removed by set-
ting the core parameters using registers accessible via the Avalon interface. This
would result in a single entity, clean and readable.
The problematics of FPGA design is a rather attractive one; since many com-
puting and embedded projects slowly transition from single core to parallel archi-
tectures, the programmable logic is a viable field of interest. With the announced
production level quality partial reconfiguration[29] the devices may soon be able
46
11. C ONCLUSION
(in an usable way) to instantiate portions of hardware design on the go, practi-
cally providing hardware on demand capability. This may well be the case for
future architectures of a digital system, where some fixed silicon sequential core
loads hardware blocks as required by an operating system. There may be inter-
esting research topics in this area considering the possible implications of the
discovery of the memristor[7], which is a circuit element capable of acting both
as a storage element and as logic. The implications of the merger of high density
storage together with programmable logic resources would be immense.
Should the memristor discovery indeed deliver the technology for the next
generation of programmable logic, the devices may be capable of further reduc-
ing the performance gap between programmable and fixed silicon and start to
appear in systems traditionally relying on an ASIC solution.
47
Bibliography
[1] IBSmm Engineering Czech Design Center official website [online]. IBSmm
CDC, [cited 2011-04-17]. URL: http://www.ibsmm.com.
[2] Wikipedia: Moore’s law [online]. Wikipedia, [cited 2011-04-17]. URL: http:
//en.wikipedia.org/wiki/Moore%27s_law.
[3] Wikipedia: Programmable logic devices [online]. Wikipedia, [cited 2011-04-
17]. URL: http://en.wikipedia.org/wiki/Programmable_logic_
devices.
[4] Wikipedia: Field-programmable gate array [online]. Wikipedia,
[cited 2011-04-17]. URL: http://en.wikipedia.org/wiki/
Field-programmable_gate_array.
[5] Altera Cyclone IV Device Handbook [online]. Altera Corp., [cited 2011-04-
26]. URL: http://www.altera.com/literature/hb/cyclone-iv/
cyiv-5v1.pdf.
[6] Altera Stratix V Device Family Overview [online]. Altera Corp., [cited 2011-
04-17]. URL: http://www.altera.com/literature/hb/stratix-v/
stx5_51001.pdf.
[7] Finding the Missing Memristor, R. Stanley Williams [online]. Lecture
recording, YouTube, [cited 2011-04-17]. URL: http://www.youtube.
com/watch?v=bKGhvKyjgLY.
[8] Achronix Speedster22i product overview [online]. Achronix Semiconductor
Corp., [cited 2011-04-17]. URL: http://www.achronix.com/products/
speedster22i.html.
[9] Wikipedia: Memristor [online]. Wikipedia, [cited 2011-04-17]. URL: http:
//en.wikipedia.org/wiki/Memristor.
[10] End of the CPU? HP demos configurable memristor [on-
line]. R Colin Johnson, EE Times, [cited 2011-04-17]. URL:
http://www.eetimes.com/electronics-news/4088557/
End-of-the-CPU-HP-demos-configurable-memristor.
[11] Intel Introduces Configurable Intel Atom-based Processor [online]. Press
release, Intel Corporation, [cited 2011-04-18]. URL: http://newsroom.
intel.com/docs/DOC-1512.
48
11. C ONCLUSION
[12] Altera Video and Image Processing Suite User Guide [online]. Altera
Corp., [cited 2011-04-26]. URL: http://www.altera.com/literature/
ug/ug_vip.pdf.
[13] Wiki: File:YCbCr-CbCr Scaled Y50.png [online]. Wikipedia, [cited 2011-

04-26]. URL: http://en.wikipedia.org/wiki/File:YCbCr-CbCr_
Scaled_Y50.png.
[14] Wiki: File:BNC connector (male).jpg [online]. Wikipedia, [cited 2011-04-

26]. URL: http://en.wikipedia.org/wiki/File:BNC_connector_
%28male%29.jpg.
[15] Wiki: File:Dvi-cable.jpg [online]. Wikipedia, [cited 2011-04-26]. URL: http:

//en.wikipedia.org/wiki/File:Dvi-cable.jpg.
[16] Video Electronics Standards Association official website [online]. VESA,

[cited 2011-04-26]. URL: http://www.vesa.org/.
[17] Society of Motion Picture and Television Engineers official website [online].
SMPTE, [cited 2011-04-26]. URL: http://www.smpte.org/home/.
[18] Wiki: Serial Digital Interface [online]. Wikipedia, [cited 2011-04-26]. URL:
http://en.wikipedia.org/wiki/Serial_digital_interface.
[19] Altera Corporation official website [online]. Altera Corp., [cited 2011-04-07].
URL: http://www.altera.com.
[20] Altera Quartus II 11.0 Web Edition development environment application,

build 157 04/27/2011. Altera Corporation.
[21] FT2232H Dual High Speed USB to multipurpose UART/FIFO

device datasheet [online]. FTDI Chip, [cited 2011-05-21]. URL:
http://www.ftdichip.com/Support/Documents/DataSheets/
ICs/DS_FT2232H.pdf.
[22] Future Technology Devices International Ltd. (FTDI): FT245R device

datasheet [online]. Future Technology Devices International Ltd.,
[cited 2008-05-21]. URL: http://www.ftdichip.com/Documents/
DataSheets/DS_FT245R.pdf.
[23] Deinterlacing (odstraneni radkoveho prokladu) [online]. Reboot maga-

zine, [cited 2011-05-26]. URL: http://reboot.cz/howto/grafika/
deinterlacing-odstraneni-radkoveho-prokladu/articles.
html?id=206.
[24] FPGA designer curriculum [online]. Altera Corp., [cited 2011-05-29]. URL:
http://www.altera.com/education/training/curriculum/
fpga/trn-fpga.html.
49
11. C ONCLUSION
[25] Avalon Interface Specification [online]. Altera Corp., [cited 2011-05-29].

URL: http://www.altera.com/literature/manual/mnl_avalon_
spec.pdf.
[26] Avalon Memory-Mapped Master Templates [online]. Altera Corp., [cited

2011-05-29]. URL: http://www.altera.com/support/examples/
nios2/exm-avalon-mm.html.
[27] The Efficiency of the DDR & DDR2 SDRAM Controller Compiler [on-
line]. Altera Corp., [cited 2011-05-29]. URL: http://www.altera.com/
literature/wp/wp_ddr_sdram_efficiency.pdf.
[28] Altera Forums [online]. Altera Corp., [cited 2011-05-29]. URL: http://
www.alteraforum.com/.
[29] Cyclone V Device Family Advance Information Brief [online]. Altera Corp.,
[cited 2011-05-29]. URL: http://www.altera.com/literature/hb/
cyclone-v/cyv_51001.pdf.
Products and/or technologies referenced in this work may be registered

trademarks of their respective owners.
50
Appendix A
Pin placement
The pin placement illustration for the target device EP4CE15F17C7N. This place-
ment is also compatible with the alternative EP3C16F256C7N.
51
Appendix B
Device floorplan
The resulting floorplan for the target device EP4CE15F17C7N showing a rela-
tively high resource usage of the chip by the design.
52
Appendix C
Slack histogram
This chart shows the TimeQuest timing analysis slack distribution histogram for
the final design.
53
Appendix D
Contents of the enclosed CD

· PDF version of this thesis
· Altera Quartus II project containing the Deinterlacer core
· Altera Quartus II project containing the Frame Buffer core
· Altera Quartus II project containing the Alpha Blender core
54

Using Low Cost Fpgas For Realtime Video Processing: Filip Roth

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Using Low Cost Fpgas For Realtime Video Processing: Filip Roth

Încărcat de

Drepturi de autor:

Formate disponibile

}w

Using low cost FPGAs for realtime

Brno, spring 2011

Advisor: prof. Ing. Václav Přenosil, CSc

2.1 History of Field Programmable Gate Arrays

2.2 Present day FPGAs

2.2.1 Programmable logic

Figure 2.1: Altera Cyclone IV LUT structure[5]

2.2.2 Routing resources

2.2.3 Embedded memory

2.2.4 Embedded multipliers

2.2.5 Development software

or has high requirements for maximum frequency, this is a computing challenge

Figure 2.2: Example FPGA development software[20]

2.3 Future possibilities of FPGAs

2.4 Video processing on an FPGA

· Video timing generation is relatively straightforward with an FPGA. Even

· By being ”close to metal”, the algorithms on an FPGA can be more effective

Broadcast video transport standards

Figure 3.1: Progressive frame structure[12]

The video format can be either progressive or interlaced. In progressive video

Figure 3.2: Interlaced frame structure[12]

Figure 3.3: YCbCr color space at 0.5 luminance[13]

3.1 Parallel digital data

3.2 Serial digital interface (SDI)

Figure 3.4: An example of SDI connector[14]

3.3 Digital Video Interface (DVI)

4.1 Video deinterlacing

4.2 Low latency

4.3 On-screen display generation

4.4 Video stream switching

4.5 Image capture

Device family selection

5.1 Design requirements

1920(width) × 540(height) × 60(f ps) × 4(Bpp) × 2(R + W ) = 474M B/s

2(datawidth) × 2(ef f ectiveperclock) × 200000000(f requency) = 800M iB/s

5.2 Altera Cyclone family

Figure 5.1: Altera Cyclone IV family logo[5]

The Cyclone family is currently divided into four generations, Cyclone I to

5.3 Xilinx Spartan family

5.4 Lattice Semiconductor Corporation

5.5 Final selection

Evaluation of commercial IP cores from Altera

6.1 Video and Image Processing Suite (VIP)

Figure 6.1: Example of the MegaWizard configuration interface[20]

6.2 DDR2 High Performance Controller II

6.3 NIOS II soft processor

Selected system structure

7.1 Block diagram

USB link Frame buffer PC video

Figure 7.1: Final video processing system block diagram

7.2 Camera video input

clock FIFO (standard component provided by Altera). This effectively transitions

7.3 Frame buffer

7.4 USB link

7.6 PC video input

7.7 Alpha blender

Example video processing cores

8.1.1 Algorithm overview

· Bob - line duplication

· Bob - line interpolation

· Weave - merging of lines of odd and even field

· Motion adaptive algorithms

}w