Acceleration of Network Traffic

Na tomto mst bude oficiln
zadn va prce
Toto zadn je podepsan dkanem a vedoucm katedry,
muste si ho vyzvednout na studiijnm oddlen Katedry pota na Karlov nmst,
v jedn odevzdan prci bude originl tohoto zadn (originl zstv po obhajob na
katede),
ve druh bude na stejnm mst neoven kopie tohoto dokumentu (tato se vm vrt
po obhajob).
ii
Czech Technical University in Prague

Faculty of Electrical Engineering
Department of Computer Science and Engineering
Masters Thesis
Acceleration of 10GbE Network Traffic

Bc. Michael Rohrbacher
Supervisor: Ing. Jan Kubr
Study Programme: Electrical Engineering and Information Technology

Field of Study: Computer Science and Engineering
December 31, 2012
iv
Aknowledgements
I would like to thank my family and my friends for their support during the time I was writing
this thesis. Also I would like to thank my supervisor Ing. Jan Kubr for the opportunity to
work on this very interesting topic, and Moris Bangoura and Radim Roka for their answers
I had about their thesis.
vi
vii
Declaration
I hereby declare that I have completed this thesis independently and that I have listed all
the literature and publications used.
I have no objection to usage of this work in compliance with the act 60 Zkon . 121/2000Sb.
(copyright law), and with the rights connected with the copyright act including the changes
in the act.
In Klatovy on Dec 31, 2012
.............................................................
viii
Abstract
This thesis is about finding the capabilities of WANic 56512 which is a network card with
a packet processor. It researches all the steps that are required to get this card working in
our laboratory. A part of this thesis discuss whether WANic 56512 is a suitable candidate
for packet generating, receiving and switching. Various benchmarks were performed to get
trustworthy results to support the final conclusion.
The output of this thesis is an installation guide which speeds up the deployment time
for a future use. Also a set of tests that can be used to measure the performance of the card
is presented. At the end of this thesis there is a comparison between WANic 56512 and the
prototypes made by Radim Roka and Moris Bangoura.
ix
Contents
1 Introduction
1.1 Motivation and goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Structure of the thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
1
2
2 Related works
2.1 Performance evaluation of GNU/Linux network bridge . . . . . . . . . . . . .
2.2 10GbE Routing on PC with GNU/Linux . . . . . . . . . . . . . . . . . . . . .
3
3
4
3 Theoretical description of WANic 56512

3.1 General overview . . . . . . . . . . . . . . . . . . . . . . . .
3.1.1 The MIPS Architecture . . . . . . . . . . . . . . . .
3.1.2 Comparison between the MIPS and x86 architecture
3.2 Hardware Acceleration Units . . . . . . . . . . . . . . . . .
3.3 Packet Flow . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 Simple Executive . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
7
7
8
10
11
11
13
4 Installation of WANic 56512

4.1 Description/Specification of the host system . . . . . .
4.2 Installation procedures . . . . . . . . . . . . . . . . . .
4.2.1 Installation procedure without buying the SDK
4.2.1.1 Diagnostic modes kernel . . . . . . .
4.2.1.2 cnusers SDK . . . . . . . . . . . . . .
4.2.2 Installation procedure with buying the SDK . .
4.2.2.1 PCI console . . . . . . . . . . . . . . .
4.2.2.2 NIC mode . . . . . . . . . . . . . . .
4.2.2.3 Ethernet PCI mode . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
15
15
16
16
16
17
18
20
22
23
5 Benchmarks for WANic 56512

5.1 RFC 2544 . . . . . . . . . . . . . .
5.1.1 Throughput . . . . . . . . .
5.1.2 Frame loss rate . . . . . . .
5.2 Benchmarks for Linux environment
5.2.1 iperf . . . . . . . . . . . . .
5.2.2 netperf . . . . . . . . . . .
5.2.3 curl-loader . . . . . . . . .
5.2.4 pktgen . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
25
25
25
25
26
26
27
27
28
.
.
.
.
.
.
.
.
xi
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
xii
CONTENTS
5.3
5.2.5 bridge-utils
Benchmarks for the
5.3.1 traffic-gen .
5.3.2 CrossThru .
. . . . . . . . . . . . . . . . . .
Simple Executive environment
. . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
28
29
29
30
.
.
.
.
.
.
.
6 Benchmarking
6.1 iperf . . . .
6.2 netperf . . .
6.3 curl-loader .
6.4 pktgen . . .
6.5 bridge-utils
6.6 traffic-gen .
6.7 CrossThru .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
35
35
36
36
36
37
37
38
7 Analysis of the
7.1 iperf . . . .
7.2 netperf . . .
7.3 pktgen . . .
7.4 bridge-utils
7.5 traffic-gen .
7.6 CrossThru .
7.7 Graphs . . .
benchmarks
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
39
39
39
40
40
40
40
41
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8 Conclusion
47
8.1 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
A Scripts
A.1 iperf . . .
A.2 netperf . .
A.3 pktgen . .
A.4 traffic-gen
B CD content
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
51
51
51
52
54
57
List of Figures
3.1
3.2
3.3
3.4
3.5
3.6
Block diagram of WANic 56512. Source: [8] .

Block diagram of OCTEON CN5650. Source:
MIPS architecture, pipelined. Source: [17] . .
Packet input. Source: [11] . . . . . . . . . . .
SSO and core processing. Source: [11] . . . .
Packet output. Source: [11] . . . . . . . . . .
4.1
4.2
Topology of the lab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

The needed DIP switch. Source: [5] . . . . . . . . . . . . . . . . . . . . . . . 17
5.1
CrossThru Flowchart. Source: [2] . . . . . . . . . . . . . . . . . . . . . . . . 31
7.1
7.2
7.3
7.4
7.5
7.6
7.7
7.8
7.9
7.10
7.11
TX and RX test - TCP - iperf . . . . . . . . . . . . . . .

TX test - UDP - iperf . . . . . . . . . . . . . . . . . . . .
TCP_STREAM and TCP_SENDFILE test - netperf . .
UDP_STREAM test - netperf . . . . . . . . . . . . . . .
TX test - pktgen . . . . . . . . . . . . . . . . . . . . . . .
RFC 2544 - throughput test - brctl . . . . . . . . . . . . .
RFC 2544 - frame loss test - brctl . . . . . . . . . . . . . .
RFC 2544 - throughput test - traffic-gen . . . . . . . . . .
RFC 2544 - throughput test - CrossThru FF+L2 . . . . .
RFC 2544 - frame loss test - CrossThru FF+L2 basic . . .
RFC 2544 - frame loss test - CrossThru FF+L2 optimized
xiii
. .
[3]
. .
. .
. .
. .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
9
10
12
13
14
41
42
42
43
43
44
44
45
45
46
46
xiv
LIST OF FIGURES
Chapter 1
Introduction
1.1
Motivation and goal
The network accelerators are an overlooked group in current research. I was able to find
only one scientific article that focuses on the usage of Cavium OCTEON processor, and
that article was about IPsec [12]. Therefore, I decided to write my thesis about network
accelerators in 10 GbE networks in general.
These accelerators could bring an interesting trade-off in a comparison with hardware
switches. For a higher price customers gets a multifunctional device that can be used not
only for switching but also for packet generating, routing, firewalling, protocol analysis and
many others. On the other hand, the question is whether we need those pricey accelerators
for these tasks, or whether we can achieve reasonable results only with the use of a much
cheaper, common PC.
Since this area is poorly documented, new challenges arise. We do not know whether
we can use the same software we use in our PCs and switches, whether it will be needed
to upgrade our network infrastructure, or whether these accelerators are really worth it to
buy. My thesis should help to answer similar questions.
The goal of this thesis is to understand how the network accelerators work and how they
can improve the performance in 10 GbE networks. This part is focused on a network card
WANic 56512 with a packet processor Cavium OCTEON. Another part of the goal is also
to document all the steps that allow the user to install WANic 56512. The documentation
provided by the manufacturer, GE Intelligent Platforms, was insufficient and with many
mistakes. Therefore, I decided to write my own installation guide that corrects all mistakes,
adds additional steps and procedures, and puts all information into one place.
Another important task is to do a research about open source benchmarks that can
test the attributes of 10 GbE networks and suggest which benchmarks could be used with
WANic 56512. The area of my focus is packet generating, receiving and switching.
The last goal is a comparison of my approach and an approach proposed by Radim Roka
and Moris Bangoura in their thesis [16], [9]. In other words, whether a network accelerator
brings any advantages over a regular PC with optimized kernel and network drivers, and
compared to a regular PC with the use of graphic cards.
CHAPTER 1. INTRODUCTION
1.2
Structure of the thesis
This thesis has eight chapters:

Chapter 2 - summarizes the related works made by Radim Roka and Moris Bangoura.
It also highlights the difference in their and my approach of packet switching.
Chapter 3 - describes the theoretical background of the WANic 56512 card, its architecture, related hardware for acceleration, packet flow etc.
Chapter 4 - explains the installation process of the WANic 56512 card and provides
necessary steps for a full operating capability.
Chapter 5 - discuss the possibilities to perform benchmarks for the WANic 56512 card.
Chapter 6 - shows the performed tests and used configuration.
Chapter 7 - analyzes the results of performed tests done by Radim Roka, Moris
Bangoura and Michael Rohrbacher.
Chapter 8 - summarizes the thesis and all the results. Also possible future works are
suggested.
Chapter 2
Related works
In this chapter I will briefly summarize two masters thesis written by my colleagues Radim
Roka and Moris Bangoura. They worked on a similar topic but with different hardware
and approach.
2.1
Performance evaluation of GNU/Linux network bridge
This masters thesis was written by Radim Roka in 2011 and deals with a problem of
creating a network bridge in 10 GbE networks with the use of the Linux operating system.
The output of this thesis is a comparison between the device designed by the author and a
hardware switch.
The author firstly covers the theory of benchmarking network devices, defines basic terms
and describes the necessary steps for different benchmarks. He also identifies different types
of delays we have to take into account in 10 GbE networks.
The main part of the thesis is about finding the appropriate hardware (CPU, motherboard, chipset, BUS, NIC, etc.) for switching traffic in 10 GbE networks as well as finding
the right tools for switching and packet generating. For switching brctl and for packet
generating pktgen are used, and the final hardware configuration is following:
2x motherboard: Supermicro X8DAH+-F,
CPUs: 2x Quad-core Xeon 5606, 2.13GHz and 2x Quad-core Xeon 5620, 2.4GHz with
HT technology,
2x 3x2GB DDR3-1066 RAM for each computer,
10GbE NIC: dual port 10GbE Intel controller 82599,
3x 1GbE NIC with 6 ports and Intel controller 82576.
CHAPTER 2. RELATED WORKS
This configuration was tested against following hardware switches:

H3C S5800 network switch 24x 1GbE and 4x 10GbE ports,
Juniper Ex3200 switch 24x 1GbE ports and 2x 10GbE.
The comparison is done against several operating systems, namely: GNU/Linux Debian,
GNU/Linux Bifrost, and FreeBSD. First of all, the author runs several benchmarks without
any optimization on one, two, and four output devices. Further on, the author proposes
some optimization, such as:
turn off the flow control,
increase the ring buffer,
assign interrupts and set up the SMP affinity,
set up the receive/transmit queues.
The performance of the proposed solution is rather poor. The system is capable of generating
64B packets almost at the wire speed, and at the wire speed for packets larger than 128B.
But the throughput for 64B packets is about 4Mpps, whereas the line rate is about 14.8
Mpps.
2.2
10GbE Routing on PC with GNU/Linux
This masters thesis was written by Moris Bangoura in 2012 and builds on the masters
thesis written by Radim Roka. The main difference here is that the computing is done on
GPGPU cores and the author was able to perform not only switching but also routing and
firewalling.
First, the background theory is described, including the description of PacketShader and
netmap framework. The author also describes the GPGPU architecture. Then he deals with
finding the right architecture, hardware and software for switching, routing and firewalling.
The final configuration is following:
2x motherboard: Supermicro X8DAH+-F,
CPUs: 2x Quad-core Xeon 5620, 2.4GHz with HT technology,
GPGPUs: 2x NVIDIA GTX 580, 3 GB RAM,
2x 3x2GB DDR3-1066 RAM for each computer,
10GbE NIC: 2x dual port 10GbE Intel controller 82599.
The author modifies the PacketShader I/O module and adds a TX IP header checksum
computing. He also creates an application based on the PacketShader framework for GPGPU
firewalling.
2.2. 10GBE ROUTING ON PC WITH GNU/LINUX
The results of the proposed prototype were quite positive. The system is able to achieve
the linereate speed for:
transmitting from 64B packets,
receiving from 256B packet,
routing from 512B,
firewalling in this case, the results depends on a number of ACL rules in the routing
table. For 0-256 rules, the system is able to operate almost at the wirespeed. With
more rules, the performance falls down rapidly.
The main difference between these theses and my thesis is that I will use a specialized
piece of hardware for packet processing which includes hardware accelerated units for better
performance. It will be a completely different architecture and it will use software written
for this tasks in particular.
CHAPTER 2. RELATED WORKS
Chapter 3
Theoretical description of WANic

56512
3.1
General overview
The GE Intelligent Platforms WANic 56512 is an intelligent, high performance packet processor which contains [8]:
Cavium OCTEONTM Plus 12-core 750 MHz CN5650 processor,
4GB of high-speed DDR2 SDRAM via VLP Mini-RDIMMs,
32 MB of DDR SDRAM persistent memory,
2GB USB Flash Disk,
2x 10 Gb Ethernet via SR/LR SFP+ transceivers,
4 lane PCI-Express host interface.
For a better understanding of WANics equipment and its connection, the block diagram
is included in figure 3.1.
The packet processor is a special type of processor built explicitly to deal with issues
which arise in computer networks. Such as monitoring, network management, security,
etc. These devices perform the data inspection, identification, extraction, and all other
kinds of data manipulation which can be later used for load balancing, traffic shaping and
routing. Their main advantage over normal processors is the presence of specific software
and hardware developed for the packet flow in particular. This assures the best possible line
rates.
The WANic 56512 is equipped with a packet processor OCTEON CN5650, developed
by a company Cavium Networks. This processor is based on the Microprocessor without
Interlocked Pipeline Stages (MIPS) architecture which I will further describe in more details later on. The main key feature of Caviums processors is the presence of Hardware
Acceleration Units. They have a huge effect on:
CHAPTER 3. THEORETICAL DESCRIPTION OF WANIC 56512
Figure 3.1: Block diagram of WANic 56512. Source: [8]
packet I/O processing, Quality of Service (QoS), TCP,

security - IPsec, SSL and 3G/UMB/LTE,
compression/decompression.
Another key feature which is very desirable nowadays is the power consumption. According to the product brief [4], Cavium Networks claims that the power consumption for
the CN5650 chip is only 10 30W. Among other considerable features belong the dedicated
DMA engines for each hardware unit, high-speed interconnects between the hardware units,
and an ability to group the cores as desired.
The communication between the host computer and the WANic 56512 card is provided
via the PCI-Express bus. The whole control management of the card is done either via the
serial console or PCI console. This process will be further described later on.
All described units can be found in the block diagram 3.2.
3.1.1
The MIPS Architecture
Since the OCTEON chip is based on the MIPS architecture, I will describe in more details
why the chip is built on this particular architecture. The text in this subsection is based on
the text of my bachelor thesis - Measurement of throughput of Ethernet Cards on Telum
NPA-5854 device [15].
The MIPS architecture is a typical example of the Reduced Instruction Set Computer
(RISC) Instruction Set Architecture (ISA), based on the principle of main registers. MIPS64
was first introduced in 1991 and it was the first 64 bit architecture in the world. MIPS uses
the register-register approach, sometimes also called the load-store architecture.
3.1. GENERAL OVERVIEW
Figure 3.2: Block diagram of OCTEON CN5650. Source: [3]
Advantages of using registers:

Registers are faster than the memory.
Access to the register can be random.
Less accesses to the memory.
Registers can store intermediate results, local variables and parameters.
Disadvantages of using registers:
The number of registers is limited.
A more complex compiler is needed.
Longer context switch.
Registers cannot store composing data structures.
Characteristic of RISC:
Fixed-length instructions (32 bit) which results in less decoding
Three-addresses architecture - all three registers need to be specified, for example:
add \$ s0, \$ s1, \$ s2 means s0 = s1 + s2.
A large number of registers to use (32).
10
All instructions take the same processing time.

Very fast instructions processing.
The pipeline processing is easy to implement.
Figure 3.3: MIPS architecture, pipelined. Source: [17]
3.1.2
Comparison between the MIPS and x86 architecture
The main difference between these two architectures is that MIPS is an example of RISC
but the x86 architecture is an example of Complex Instruction Set Computer (CISC) ISA.
Other differences
MIPS has aligned data, x86 not.
MIPS has 32 registers, x86 has only 8.
MIPSs return address is always the register 31, x86 uses the stack.
MIPS is a load-store architecture (simpler hardware, easier to pipeline, higher performance), x86 is a memory-register architecture (fewer instructions in the program
results to a smaller code, more complicated hardware, and more instructions need to
be implemented).
MIPS has the fixed-length instructions, x86 has the variable-length instructions.
3.2. HARDWARE ACCELERATION UNITS
3.2
11
Hardware Acceleration Units
Each OCTEON chip contains several hardware acceleration units that offload the work and
free the cores. The units can be divided into:
Packet-management accelerators The traffic of packets can be very enormous in busy
networks. Therefore, it is desirable to offload the time consuming packet processing
from the cores. The packet-management accelerators are responsible for packet receiving, transmitting, buffering, QoS and the packet flow. The packet data buffers
are automatically created and freed. In terms of TCP and UDP, the packet headers
are automatically checked on receive and the checksum is automatically calculated on
transmit. Also the TCP retransmission is implemented in the timer unit. The packet
ordering and scheduling is managed by its own unit.
Security accelerators These accelerators are responsible for generating random numbers, accelerating security algorithms and related operations. Such as MD5, SHA,
3DES, AES, RC4, KASUMI, RSA and TKIP.
Application accelerators The CN5650 chip has units to provide acceleration for
DEFLATE compression/decompression, CRC checksums for ZLIB and GZIP, and acceleration for RAID 5 and RAID 6.
3.3
Packet Flow
Before explaining the packet flow it is convenient to describe units that are involved in this
process.
SSO unit - The Schedule/Synchronization and Order Unit manages the packet scheduling and ordering.
PIP unit - The Packet Input Processor Unit works with IPD to manage the packet
input.
IPD unit - The Input Packet Data Unit works with PIP to manage the packet input.
PKO unit - The Packet Output Unit unit manages the packet output.
FPA unit - The Free Pool Allocator Unit manages pools of free buffers, including
Packet Data buffers.
The process of the packet flow is crucial for the understanding how the packet processing
is performed inside the OCTEON chip and also for writing new software applications. The
whole process can be divided into three main sections.
Packet input In this phase the packet is received and checked for errors by the RX port.
Then the packet is passed to the IPD unit where the packet data is shared with the PIP
unit. The PIP unit is responsible for the packet parsing. The IPD unit stores the packet
data in the Packet Data Buffer (allocated from the FPA unit) in L2/DRAM. The DMA
approach is used for this process. Also a pointer to the appropriate QoS queue in the SSO
unit is created. See figure 3.4 for more details.
12
Figure 3.4: Packet input. Source: [11]
SSO and core processing The SSO unit schedules the work to be done based on the
QoS priority, ingress order and current locks. The cores then process the packet data which
is read and written in L2/DRAM. After this processing, each core sends a pointer to the
packet data buffer and data offset to the appropriate Packet Output Queue in the PKO
unit. The output port and packet priority are specified. See figure 3.5 for more details.
Packet output Within this phase the PKO unit copies the data from the buffer described
above into its own memory and adds the TCP or UDP checksums if desired. Then the PKO
unit sends the data from its memory to the output port and the packet is transmitted by
the TX port. See figure 3.6 for more details.
Many of the hardware acceleration units described in the previous section play a significant
role in the packet flow. This results in elimination of bottlenecks because cores can work
on packet processing in parallel, without the need of classification and prioritization of the
packets.
3.4. SIMPLE EXECUTIVE
13
Figure 3.5: SSO and core processing. Source: [11]
3.4
Simple Executive
The simple executive is an Application Programming Interface (API) which provides a

Hardware Abstraction Layer (HAL) to the hardware units included on the OCTEON chip.
The functions provided by the simple executive API can be used to develop a standalone or
user-mode simple executive application. The user-mode means that the application is run
from the Linux operating system. The differences between these two run-time modes have
a huge influence to the overall performance of the application.
Standalone mode When running an application in the standalone mode, the best possible performance should be assured. There are no context switches and the whole memory
is mapped for fast access. There is also a great opportunity for scaling because all the cores
can run the same application.
14
Figure 3.6: Packet output. Source: [11]
User-mode On the other hand, when an application is run in the user-mode, we have
to take into account the cache and TLB misses, and higher traffic on the buses. At least
one core is reserved for the Linux, and also the memory has to be divided between the
application and the Linux kernel.
In this chapter I tried to point out the most important pieces of hardware represented
on the card. In section 3.1.1 I showed why the designers of the OCTEON chip chose the
MIPS architecture over the x86. It is due to the fact that this architecture is the best
candidate for parallel processing. Which is highly desirable to achieve the best packet
processing performance. Since the process of handling packets, especially the small ones,
can be very CPU consuming, the designers added hardware accelerators to the card to ease
the CPU usage. I also described the main difference between the standalone and user-mode
for applications developed for the card.
Chapter 4
Installation of WANic 56512

The installation of the card was the most time-consuming and difficult part of my thesis.
Therefore, I decided to include procedures and approaches I tried as a regular chapter.
4.1
Description/Specification of the host system
The WANic 56512 is inserted into the following host system:

R CoreTM 2 Duo CPU E8200 @ 2.66GHz,
Intel
2x Corsair 2GB DDR2 RAM (CM2X2048-6400C5, 800 MHz),

Gigabyte GA-EP45-DS4 motherboard,
nVidia GeForce 9600 GT,
Linux kernel 2.6.32-40-generic SMP i686, Ubuntu 10.04 LTS,
Seasonic SS 500GB Active PFC F3 power supply.
The first problem was to find a power supply powerful enough to run both the computer
and the card. After trying several power supplies I found that the power supply needs to
be at least 500W.
The next issue was cooling the card. The card is designed for an air-cooled chassis environment that we do not have. First attempts with the card resulted in a permanent damage
due to the overhead. Nevertheless, the overhead was most likely caused by a hardware failure. After the reclamation we used an optional fan that provides more air-flow and keeps
the temperature bellow 105 C which is desired by the manufacturer.
The card is connected to the host system via the PCI-Express x8 bus and to the laboratory network via the H3C S5800 switch. The whole network topology includes also the
generator which is also connected to the switch. The topology is shown in figure 4.1.
15
16
CHAPTER 4. INSTALLATION OF WANIC 56512
Figure 4.1: Topology of the lab.
4.2
Installation procedures
The installation process of the card was much more complicated than I expected. There are
basically two approaches how to install the card depending on whether we buy the SDK or
not.
4.2.1
Installation procedure without buying the SDK
My first attempt was to simply connect the WANic 56512 into the host system and wait
whether the card will be recognized by the operating system or not. Ubuntu successfully
recognized the card the lspci command showed: 02:00.0 MIPS: Cavium Networks Octeon
CN57XX Network Processor (CN54XX/CN55XX/CN56XX) (rev 09), but the interfaces on
the card were not recognized and they did not show up in ifconfig.
4.2.1.1
Diagnostic modes kernel
The next move was to use the diagnostic mode. This mode contains a Linux kernel image
configured by the manufacturer and is stored on the cards memory. To run this image, the
DIP switch 4 needs to be changed to the ON position. The location of the needed switch
is shown in the figure 4.2. Next, we need to establish a connection between the card and
the host system to see the output. The only option without buying the SDK is a serial
console. We already have the needed 20-pin <> RS-232 connector from my bachelors
thesis. Otherwise, we would have to buy the serial adapter kit.
A program called minicom is used to connect to the serial output of the card. Parameters
of the connection are: 115200 8N1, no flow control. After powering up the card, the serial
output is redirected to our console. And after the boot process is finished, the busybox
prompt is shown.
The busybox environment is very limited. Only a few basic commands are available,
such as ls, cat, cp, ping, chmod, netstat and vi.
I was able to get the Debian GNU/Linux running on the card using following steps. I
used the wget program to copy the basic Debian GNU/Linux root file system to the built-in
flash memory on the card. Then I created a chroot environment:
mkdir /mnt/chroot
mount /home/root/usb /mnt/chroot
mount -o bind /proc /mnt/chroot/proc
mount -o bind /dev /mnt/chroot/dev
chroot /mnt/chroot
4.2. INSTALLATION PROCEDURES
17
Figure 4.2: The needed DIP switch. Source: [5]
The chroot environment changes the root directory of the diagnostic kernel to the root
directory of the copied Debian GNU/Linux. After this change, we have a fully working
operating system on the card and we can install tools for benchmarking. The /dev and
/proc directories are needed to provide network interfaces.
The problem with this solution is that there is no possibility to change the kernel configuration. For instance, if we want to use pktgen which is a kernel module for generating
packets, we need to recompile the kernel with required options first. But this is not feasible
with this configuration.
4.2.1.2
cnusers SDK
The next possibility was to use a freely available cnusers SDK. This SDK can be downloaded
at http://www.cnusers.org/ after a successful registration and its approval. It contains,
among others, the source code of U-BOOT, Linux kernel 2.6.32 and a few examples. Unfortunately, it does not contain any patches for the WANic 56512 card. Therefore, the built-in
flash memory is not accessible and the octeon-ethernet driver is not optimized.
Nevertheless, I was able to cross-compile the Linux kernel with the pktgen module and
get a fully working kernel. This SDK also contains the needed octeon-ethernet driver for
maintaining network interfaces. But as I mentioned above, the driver is not optimized for
the WANic 56512 card. It is merely a generic driver.
Cross-compilation is a special method that gives us the ability to compile a desired
source code for an architecture other than the one on which the compiler is. In my case it
was from x86-64 to MIPS. The cnusers SDK contains all necessary tools for it toolchains.
18
The advantage of cross-compilation is that we can use the computing power and storage
capacity of the host system rather than the limited resources of the embedded device.
The process of copying a new kernel into the card is not very convenient. I had to use
a TFTP server and copy the new kernel over a TFTP protocol. I ran following commands
in the U-BOOT environment:
setenv ipaddr 10.101.1.101
setenv serverip 10.101.1.100
setenv ethact octeth0
ping 10.101.1.100
tftpboot 0x20000000 /tftpboot/vmlinux.64
bootoctlinux 0x20000000 coremask=0xfff
First, I set up the IP addresses of the card, the TFTP server and the active network interface. The ping command is necessary to bring the interfaces up. The tftpboot command
copies the kernel image from the TFTP server to a specific address (the address 0x2000000 is
recommended by the manufacturer). Finally, the bootoctlinux command boots the copied
kernel image. With the coremask parameter, I can specify how many cores will run the
loaded image (0xfff = 12).
The limitation of this solution is the unavailability of the built-in flash memory without
it, I cannot save any information or install new programs. There is only the shared memory
available which is cleared after every reboot. Yes, there is a possibility to use the NFS
protocol but this solution would require access to one of the ports on the card. And since
the card has only two of them, this approach is out of the question.
4.2.2
Installation procedure with buying the SDK
From previous statements, it is obvious that we need to buy the SDK to get a better control
of the card. We bought the SDK from GE Intelligent platform believing that it will contain
everything needed for a software development. As I discovered later, the SDK contains only
the SDK from cnusers site with patches for our card and some more examples. But it does
not contain the documentation of simple executive API and its functions. Nevertheless,
with this SDK I was able to fully maintain the WANic 56512 card even via the PCI console.
I will now describe the necessary steps how I proceeded. I have to say that the level
of documentation provided by the manufacturer was very poor. It does not contain the
essential information and it was full of mistakes. Hence, I will provide the installation steps
in a correct way.
First of all, there is a mistake in the installation script. The shell in the host system
could not process lines with command >& /dev/null so I had to change it to:
command >/dev/null 2>&1. A working installation script is included on the attached CD.
After this change, I was able to successfully install the SDK:
19
cd /the/main/directory/of/the/CD-ROM
sh install.sh /home/octeon/
This command will decompress files into the desired directory. Then I had to specify the
OCTEON model for GNU toolchains that are needed for the cross-compilation. There is
a script that will make necessary changes in the bash environment (this script needs to be
run before every cross-compilation):
cd /home/octeon/OCTEON-SDK
source env-setup OCTEON_CN56XX_PASS2
Then I had to apply patches that add support for the WANic 56512 card. They include
support for the PCI console, built-in flash memory, ethernet driver, etc. The patches can
be applied by running:
cd /home/octeon/OCTEON-SDK/cav-gefes
make patches-install
A very useful thing is to include programs I need for the benchmarks directly to the root
directory of the embedded file system. For example if I want to use iperf without running
the chroot environment, I have to do following:
mkdir /tmp/extra-files
cp iperf /tmp/extra-files
cd /home/octeon/OCTEON-SDK/linux/embedded_rootfs
make menuconfig
Select programs I want - bridge-utils, ethtool, tcpdump
Specify the Embedded rootfs extra-files directory
make all
Everything in the /tmp/extra-files directory will be included into the embedded file
system. A good point to mention here is that the /tmp/extra-files directory has to be
created even if we do not want to include any additional programs. Without this directory,
the compilation would be unsuccessful. With a prepared root file system, I could proceed
to a cross-compilation of the SDK.
make menuconfig
In the Build Options, under the Embedded Linux Options choose the
Manufacturing Build
In the Build Options, specify the Embedded rootfs extra-files directory
Select bash for a better shell
make all
For a successful cross-compilation I had to install also yacc, flex and gettext packages.
Next, I edited the /home/octeon/OCTEON-SDK/linux/kernel_2.6/kernel.config file to
enable the pktgen module. Another kernel options can be specified in this file. After the
successful cross-compilation, the final kernel image is stored in
/home/octeon/OCTEON-SDK/linux/kernel_2.6/linux/ as a vmlinux.64
20
4.2.2.1
PCI console
To get a working PCI console redirection, I had to first recompile the U-BOOT to the new
version.
cd /home/octeon/OCTEON-SDK/bootloader/u-boot
make clobber
make octeon_w56xx_config
//for a regular image
make octeon_w56xx_ram_debug_config //for a RAM debug image
make octeon_w56xx_failsafe_config //for a Failsafe image
make
The PCI console redirection works only with the regular U-BOOT image. For a working
PCI console I had to compile and load the PCI driver first. This driver allows the PCI
communication between the WANic 56512 and the host system. A very important note for
a cross-compilation of the driver is that the host system must contain header files of the
kernel which is run on the system. Without it, the cross-compilation process ends with a
very hard-to-find make error.
cd /home/octeon/OCTEON-SDK/
make -C components/driver/
insmod components/driver/bin/octeon_drv.ko
After running the insmod command, the driver should be loaded. Verification could be done
by running dmesg and the output should look like this:
[29520.022310]
[29520.022314]
[29520.027178]
[29520.027181]
[29520.027183]
[29520.027185]
[29520.027209]
[29520.027216]
[29520.027230]
[29520.027236]
[29520.027241]
...
[29520.027446]
[29520.037373]
[29520.037381]
[29520.039233]
[29520.039241]
[29520.039247]
...
octeon_drv: module license Cavium Networks taints kernel.

Disabling lock debugging due to kernel taint
-- OCTEON: Loading Octeon PCI driver (base module)
OCTEON: Driver Version: PCI BASE RELEASE 2.0.0 build 73
OCTEON: System is Little endian (250 ticks/sec)
OCTEON: PCI Driver compile options: NONE
OCTEON: Found device 177d:50..Initializing...
OCTEON: Setting up Octeon device 0
Octeon 0000:02:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
Octeon 0000:02:00.0: setting latency timer to 64
OCTEON[0]: CN56XX PASS2.1
OCTEON[0]: BIST enabled for CN56XX soft reset
OCTEON[0]: Reset completed
OCTEON[0] Poll Function (Module Starter arg: 0x0) registered
OCTEON[0]: Detected 4 PCI-E lanes at 2.5 Gbps
OCTEON[0]: Enabling PCI-E error reporting..
OCTEON[0]: CN56XX Pass2 Core Clock: 750 Mhz
[29520.039299]
[29520.039304]
[29520.039318]
[29520.039375]
[29520.039922]
21
Octeon 0000:02:00.0: irq 31 for MSI/MSI-X

OCTEON[0]: MSI enabled
OCTEON: Octeon device 0 is ready
-- OCTEON: Octeon PCI driver (base module) is ready!
-- OCTEON: Octeon Poll Thread starting execution now!
Now, when I have the U-BOOT image and the driver loaded, I can boot the card over the
PCI bus. For the PCI boot, the DIP switch 3 needs to be changed to the ON position.
These commands will boot the card and the output will be redirected to the PCI console:
export OCTEON_REMOTE_PROTOCOL=PCI:<dev>, where the <dev> is a number from
the dmesgs output - usually it is 0
cd /home/octeon/OCTEON-SDK/host/remote-utils
oct-remote-boot --board=W5651X --filename=<filename>, where the <filename>
is the path to the .bin U-BOOT image
oct-remote-bootcmd "setenv pci_console_active yes"
oct-remote-console <dev> --noraw
If the PCI console is not a requirement, I can send the input to the card with the
oct-remote-bootcmd command.
And now I have a fully working kernel with all the changes in .config I wanted and the
PCI console redirection. To start working with the card I can use the following procedure:
export OCTEON_REMOTE_PROTOCOL=PCI:0
./oct-remote-boot --board=W5651X ../../bootloader/u-boot-octeon_w56xx.bin
./oct-remote-load 0 ../../linux/kernel_2.6/linux/vmlinux.64
./oct-remote-bootcmd "bootoctlinux 0 coremask=0xfff console=pci"
These commands will boot the card with a compiled kernel and the output will be redirected
to the PCI console. After a successful boot, the command prompt will be ready. It is
convenient to create a chroot environment:
mkdir /mnt/chroot
mount /dev/sda/mnt/chroot
mount -o bind /proc /mnt/chroot/proc
mount -o bind /dev /mnt/chroot/dev
modprobe npaDriver
modprobe octeon-ethernet
modprobe pktgen
chroot /mnt/chroot
bash
Now we have a fully working Linux environment with everything we need to perform
measurements with benchmarks. The rest of this chapter is a description of WANics features
that can simplify the work with the card.
22
4.2.2.2
NIC mode
The WANic 56512 can operate in two different modes. The first one is the NIC mode and
allows us to use the front-ends on the card via the host system. To enable this mode,
following steps need to be performed:
make menuconfig
In the Cavium PCI Components, under the Build Options choose the
2 x 10G Port NIC mode
Select all the remaining options in the Cavium PCI Component menu
make all
There is a problem in the compilation process with the OCTEON NIC driver because
of the exported symbol dependencies. The NIC driver depends on exported symbols from
the PCI driver, but the information about them is missing in the module. Therefore, it
is needed to specify where the compiler should look for them. The solution is to copy the
Module.symvers file from
OCTEON-SDK/components/driver/host/driver/linux to
OCTEON-SDK/components/driver/host/driver/linux/octnic. This file contains a list of
all exported symbols used both by the PCI and the NIC driver. Now I can continue with
booting the card and loading drivers.
Now I need to load some application to initialize and recognize the interface ports.
Without this step, the NIC driver would not work properly. There is an example application
in the SDK from GE called cvmcs-nic.
./oct-remote-load 0 ../../components/driver/bin/cvmcs-nic.strip
./oct-remote-bootcmd "bootoct 0 coremask=0xfff"
Now, there is a tricky part. The application is loaded and working but I had to wait for
the first output of driver stats on the console. After this occurrence I could continue with
loading another kernel module:
23
insmod ../../components/driver/bin/octnic.ko
Similar messages can be found in dmesg output:
[ 1487.306913] OCTEON[0]: Received active indication from core
[ 1487.306923] OCTEON[0] is running NIC application (core clock: 750000000 Hz)
[ 1488.136798] OCTEON[0]: Starting module for app type: NIC
[ 1489.140130] OCTEON[0] Poll Function (Module Starter arg: 0x0)
completed (status: Finished)
[ 1577.558616] -- OCTNIC: Starting Network module for Octeon
[ 1577.558627] Version: PCI NIC RELEASE 2.0.0 build 73
[ 1577.558633] OCTNIC: Driver compile options: XAUI_DUAL
[ 1577.558642] OCTEON: Registered handler for app_type: NIC
[ 1577.558647] OCTEON[0]: Starting modules for app_type: NIC
[ 1577.558655] OCTNIC: Initializing network interfaces for Octeon 0
[ 1577.567103] OCTNIC: oct0 -> 10000 Mbps Full Duplex UP
[ 1577.569611] OCTNIC: oct1 Link Down
[ 1577.569621] OCTEON[0] Poll Function (NIC Link Status arg: 0xfa16f000) registered
[ 1577.569626] OCTNIC: Network interfaces ready for Octeon 0
[ 1577.569631] -- OCTNIC: Network module loaded for Octeon
Now I can access the two interfaces on WANic 56512 via oct0 and oct1 interfaces on
the host system.
4.2.2.3
Ethernet PCI mode
This second mode allows us to use the processing power of the OCTEON chip by sending
all the traffic of the host system interfaces to the card using the Ethernet frames over PCI.
To enable this mode, following steps need to be performed:
make menuconfig
In the Cavium PCI Components, under the Build Options choose the
EtherPCI mode
Select all the remaining options in the Cavium PCI Component menu
make all
There is the exactly same problem with the driver as in the previous case; the solution is
the same. After the fix, I could continue with booting the card and loading drivers.
./oct-remote-load 0 ../../../vmlinux.64
./oct-remote-bootcmd "bootoctlinux 0 coremask=0xfff console=pci"
24
Now I had to load the modified octeon-ethernet driver by performing

modprobe octeon-ethernet
on the WANic 56512 card. Then I needed to load the NIC driver on the host system
insmod components/driver/bin/octnic.ko
After loading the NIC driver, I could see the octX interfaces on the host system and the
pciX interfaces on the WANic 56512 card (oct0 refers to pci0 etc.). These interfaces can
be configured using the ifconfig.
Chapter 5
Benchmarks for WANic 56512

Benchmarks for the WANic 56512 card can be divided into two main groups, depending on
the environment in which the benchmarks are run. Each group can be furthermore divided,
depending on what we want to measure RX, TX, switching, etc. The measurements will be
based on principles from RFC 2544 (Benchmarking Methodology for Network Interconnect
Devices) [10]. I will try to perform identical measurements with different programs to find
the best benchmark for each task. I will also try to perform each task on a different number
of cores to verify whether the task is core-dependent or not.
5.1
RFC 2544
This RFC defines a set of tests that can be used to measure the performance and parameters
of the tested network and network devices. In my thesis, I will perform the throughput and
the frame loss rate test.
The RFC 2544 also defines the frame sizes of packets to be used in the measurements.
The defined sizes are: 64, 128, 256, 512, 1024, 1280, 1518 and the unit is byte.
5.1.1
Throughput
To perform this test, we need to send a chosen number of packets (x) at a specific rate (frames
per second) to the tested device. Then we need to count the frames that are transmitted by
the tested device (y). If x 6= y, then the test needs to be re-run with an adapted rate value.
The resulting throughput is then the fastest rate at which the count of test frames
transmitted by the Device under Test (DUT) is equal to the number of test frames sent to
it by the test equipment. [10]
5.1.2
Frame loss rate
To perform this test, we need to send a chosen number of packets (x) at a specific rate (frames
per second) to the tested device. Then we need to count the frames that are transmitted by
25
26
CHAPTER 5. BENCHMARKS FOR WANIC 56512
the tested device (y). The frame loss rate at each point is calculated by using the following
formula:
((x y) 100)/x
This test should start at the 100% of the maximum rate of the input media. Then the
whole procedure should be repeated for the 90% of the maximum rate and so on. This
process should be repeated until there are two runs of the test with no frame loss.
5.2
Benchmarks for Linux environment
In this section I will describe available benchmarks that run on GNU/Linux operating
system. I will focus only on open-source programs and tools. The goal is to find the best
benchmark tool for packet generating, transmitting, receiving, and packet switching that
will run on WANic 56512.
5.2.1
iperf
The first program in my list of benchmark tools is iperf. This tool is used to measure and
verify the bandwidth of the tested network. It is capable of generating both the TCP and
the UDP packets and it runs as a client-server model.
This tool can be used for benchmarking parameters of transmitting and receiving packets.
iperf runs in the user-space, therefore, there is a limitation of the overall performance
because of the system calls.
The whole program is controlled from the command line and its parameters. iperf
can be started by running iperf -c x.x.x.x on the client side, where the x.x.x.x is the
IP address of a server, and accordingly on the server side iperf -s. There are several
parameters that can affect the overall performance:
-w Set the TCP window size.
-u Use the UDP rather than the TCP.
-b When using the UDP, sets the bandwidth to send at in bits/sec.
-l Set the length of the buffer to read or to write. This basically means the size of
the packet.
-d Run the bi-directional test simultaneously.
-t Set the time duration of the test in seconds.
-P Set the number of parallel connections when using the TCP protocol.
-V Use IPv6 instead of IPv4.
5.2. BENCHMARKS FOR LINUX ENVIRONMENT
5.2.2
27
netperf
netperf is another tool for benchmarking transmitting and receiving of packets. netperf is
similar to iperf, both programs are based on the client-server architecture, both programs
run in the user-space, and both programs can generate as the TCP so the UPD packets.
The advantage of using netperf is the TCP_SENDFILE and the UDP_SENDFILE test. This
test should result in a lower CPU utilization and higher throughput because the data can
be send directly from the file system buffer cache.
The known disadvantage of using netperf is the lack of shaping algorithms. This means
that there is no control of outgoing traffic, which results in flooding the receiver (sending
is easier than receiving). This fact can have a several influence on the performance results
while using UDP packets.
The maintenance of netperf is similar to iperfs. The client can be run by typing
netperf -H x.x.x.x into the command line, where x.x.x.x is the IP address of the server.
The server can be started either from the command line by typing netserver or as the inetd
service. There are also some important parameters:
-t Specify the test to perform. TCP_STREAM, TCP_SENDFILE etc.
-T Bound netperf to a specific CPU.
-m Set the size of the buffer passed in to the send calls of a _STREAM test.
-M Set the size of the buffer passed in to the receive calls of a _STREAM test.
-s Set the size of the netperf send and receive socket buffer for the data connection.
-S Set the size of the netserver send and receive socket buffer for the data connection.
-D Display the results immediately during the performed test.
With the -T parameter I can run netperf in multiple instances on separate CPUs to
avoid the CPU bottleneck. For instance, if I have 4 CPUs, I can do following:
netperf
netperf
netperf
netperf
-H
-H
-H
-H
x.x.x.x
x.x.x.x
x.x.x.x
x.x.x.x
-T0,0
-T1,1
-T2,2
-T3,3
-l
-l
-l
-l
120
120
120
120
&
&
&
&
The -l 120 argument is for a longer run of the test to assure that the tests will run
simultaneously.
5.2.3
curl-loader
curl-loader is a slightly different benchmark tool. The main diversity lies in creating
thousands of virtual clients that can connect to a website. curl-loader is often compared
to commercial products, such as Spirent Avalanche and IXIA IxLoad.
28
The clients can connect and login to a specific website using one of the following protocols: HTTP, HTTPS, FTP and FTPS. Their IP addresses can be shared, unique or assigned
from the IP pool. The number of clients is hardware-dependent it can oscillate between
2500 and 100 000, and even more.
The goal of this tool is to generate as much clients as possible and let them connect to
a single web-server. This method is called a stress test and reminds the basis of a Denial
of Service (DoS) attack. I would like to find out whether the WANic 56512 can be used for
generating reasonable traffic for the DoS attack and whether the card can handle to receive
and process such traffic.
curl-loader is controlled from a command line and needs a configuration file to run. To
run curl-loader, type curl-loader -t x -f <file_name> into the command line, where
x is the number of running threads and <file-name> is the configuration file.
5.2.4
pktgen
pktgen is a kernel module for generating packets at wire speeds. To use this module, we
have to first include the pktgen module into the current kernel. To do so, we need to enable
CONFIG_NET_PKTGEN in the .config file, recompile the kernel and insmod or modprobe the
pktgen module. pktgen then creates a thread for each CPU in the system.
The advantage of pktgen is the fact that it runs in the kernel-space. This should assure
the best possible performance for packet generating because there are less system calls and
interruptions than while running in the user-space.
The main disadvantage is that pktgen can generate only the UDP packets. Another
disadvantage is the impossibility to assign more than one interface device to one CPU. This
option is really crucial since nowadays computers usually have several CPUs. Fortunately,
there is a patch for pktgen that solves this problem [14]. This patch adds the multiqueue
architecture which results in a possibility to assign more than one interface device to one
CPU.
The maintenance of pktgen is not very user-friendly. The process of generating packets
is configured by the configuration file. The best way to create the configuration file is to
have a look at examples [13] and modify them to our needs.
pktgen can be simply run by executing our configuration file script. The result of the
benchmark can be viewed by typing cat /proc/net/pktgen/name_of_the_device into the
command line.
5.2.5
bridge-utils
bridge-utils is an administrative package of utilities that allows the user to set up a

Linux bridge. This bridge can operate not only as a bridge or switch, but it can also
perform filtering and shaping of incoming and outgoing traffic.
To set up a Linux bridge, set 802.1d Ethernet Bridging under the Networking menu
during the kernel compilation. After a successful compilation, load the module with the
modprobe or the insmod command. If the loading process was successful, the brctl command should be now functional. There are some important parameters of brctl:
5.3. BENCHMARKS FOR THE SIMPLE EXECUTIVE ENVIRONMENT
29
addbr <name> Add a bridge with a specified name.

addif <name> <device> Add a device to a specified bridge.
stp <name> <on/off> Turns the Spanning Tree Protocol in a specific bridge on/off.
After creating the bridge, selected devices will enter the promiscuous mode. This mode
results in receiving all traffic in the network by the selected devices which can have a several
impact on overloading the CPUs.
5.3
Benchmarks for the Simple Executive environment
In this section I will describe available benchmarks and tools that run in the Simple Executive environment. The choice was very limited, in fact I was narrowed down to use only
benchmarks from GE Intelligent Platforms and Cavium Networks.
Benchmarks for the Simple Executive environment should provide the best possible
performance in packet processing since they are developed specifically for a device with the
OCTEON chip. This means that the applications should use all the available hardware
accelerators and they should be optimized for an architecture used in these devices.
Another improvement can be achieved by the fact that there are no interruptions and
communication between the application and the operating system. Thus, applications for
the Simple Executive can run without the operating system, e.g. GNU/Linux. This situation
can bring another performance improvement there is no need to save at least one core for
the operating system. Therefore, all the available cores can be assigned to the one, running
application.
5.3.1
traffic-gen
traffic-gen is a benchmark tool for generating packets on a device with the OCTEON
chip. traffic-gen can generate packets of all sizes and both protocols TCP and UDP.
According the README file, traffic-gen is capable of sustaining 10gbps line rate for all
packet sizes. [7]
The familiarization with the traffic-gen was not easy at the beginning. The version
included in the SDK from GE (2.0) had some problems with XAUI/DXAUI interfaces.
Fortunately, in the latest version of the cnusers SDK (2.3 from 8th of March, 2012) is this
bug repaired. With the old version, the application was very unstable. For example, I was
able to run it only once in five tries in a row and the whole behavior of the application was
completely random. First I thought that there is a problem with the hardware flow control
which is needed to pace the output. Therefore I tried different terminal emulators and serial
cables, but it did not solve the issue. Luckily, with the new version, everything works fine.
To run traffic-gen, following steps need to be performed (the serial console output is
needed):
30
cd /home/octeon/sdk/OCTEON-SDK/host/remote-utils/
./oct-remote-boot --board=W5651X ../../bootloader/u-boot-octeon_w56xx_ram_debug.bin
./oct-remote-load 0 ../../examples/traffic-gen/traffic-gen
./oct-remote-bootcmd "bootoct 0 coremask=0xfff"
After loading and booting the ELF image, traffic-gen shows a huge amount of statistics
on the serial console output which can be very confusing. Following commands show only
important statistics:
row 1 43 on
row 58 60 off
row 79 81 on
traffic-gen has an enormous amount of parameters (83!), but only a few of them are
important for our generating purposes (the full list is available in the README file [7]):
tx.size [[<port range>] <size>] Set size of packet, excluding the frame CRC.
tx.percent [[<port range>] <%>] Set the transmit rate as a % of gigabit.
tx.payload [<port range> <type>] Set the data type for the payload.
tx.type [<port range> <type>] Set the type for the packet.
start [<port range>] | all Start transmitting on these ports.
stop [<port range>] | all Stop transmitting on these ports.
hide [<port range>] | all Hide the statistics for these ports.
bridge [[<port range>] <port>|off] Bridge incoming packets for a port.
5.3.2
CrossThru
CrossThru is being developed by GE Intelligent Platforms, Inc. This benchmark application

can be used to perform the packet bridging and switching. According to the performed task,
CrossThru can be compiled and run in two different processing modes:
Fast Forwarding This mode is the simplest one with as few as possible performed operations on packets. It basically receive packets on one XAUI port and transmit them through
the second XAUI port out. This mode should provide the best possible performance.
31
Figure 5.1: CrossThru Flowchart. Source: [2]
Layer 2 Ethernet Switching With this mode, the basic Layer 2 switching operations
are possible. Such as creating and managing a lookup table. When the MAC address is
found in the lookup table, the packet is transmitted through the learned port. When the
MAC address is not found in the lookup table, the CrossThru application will broadcast
that the destination MAC address is unknown.
Each of these modes can be further optimized. Without optimizations, the Synchronous
Packet Order Work (POW) requests and blocking work receive operations are used. This
means that packet input and output operations cannot be performed simultaneously. Also
there is only one queue for the packet output, which is shared with all the cores according
to the FIFO principle. The available optimization methods are following:
Asynchronous POW method enables to receive incoming packets in the background
and process the previous packet at the same time. This results in reducing the wait
cycles required to get the next packet.
32
Non-blocking permits the incoming packets to be stored and processed later.

Lock-less Packet Output (PKO) allows to assign each core with its own output
queue. This means that cores does not have to wait for an available output queue
which speeds up the packet flow process. The number of queues can be specified during
the compilation process. The numbers are within a range 1-6 for each interface. With
this method the software needs to use a software tagging of packets because there can
be issues with a packet ordering.
The installation process is similar to the installation of the SDK.
cd /home/octeon/sdk/OCTEON-SDK
cd cav-gefes
make menuconfig
Under the SE/Linux Example Applications menu select CrossThru
Under the CrossThru Options select desired optimization
Under the Build Options menu select Basic Embedded Linux Build
make all
There is a fatal mistake in the documentation from GE. CrossThru has to be run in the
normal U-BOOT mode. With any other mode, CrossThru will fail to load.
The problem with running CrossThru is that it does not provide any kind of statistics
counters by default. When the application runs, there is no output on the serial or PCI
console. Slightly more statistics can be obtained by specifying the debug mask:
./oct-remote-bootcmd "bootoct 0 coremask=0xfff --debug <debug mask>"
where the <debug mask> is within 0x0 and 0xFFF (0xFFF means output everything mask,
the full list of codes is included in [2]). But we have to be careful with the debug mask. Too
much debug information has an adverse effects on the processing performance. But even
with the debug mask, there are no useful information or counters for our purposes.
The GE includes another debug utility within its SDK called dbgse. This utility can,
according the documentation [1], display: CPU usage, line usage, port packet statistics
counters, and L2 table. These statistics would perfectly fit our needs but unfortunately, I
was not able to run the dbgse utility. First, we need to compile the application. The process
is the same as for the CrossThru, but a different option has to be selected:
Under the SE/Linux Example Applications menu select SE Debug Linux Utility
The problem is that during the load phase of dbgse the whole terminal freezes:
./oct-remote-load 0 ../../cav-gefes/examples/simple_exec/crossthru/crossthru
./oct-remote-bootcmd "bootoct 0 coremask=0xffe"
./oct-remote-load 0 ../../linux/kernel_2.6/linux/vmlinux.64
./oct-remote-bootcmd "bootoctlinux 0 coremask=0x1 mem=1024@3072M console=pci"
dbgse
33
Now the application returns an error that npa device could not be found and the application
could not be started. I had to add the npa device:
cat /proc/devices and search for npa, usually it is # 251
mknod /dev/npa c 251 0
mkdir /var/log
modprobe npaDriver
At this point, the terminal freezes. I did the debugging and I successfully localized the exact
line and instruction where the program freezes and sent it as a bug to GE. As a reply I got
a message that they do not know why does the program freeze at this point, but they will
look into it. After this reply, I sent several e-mails asking about my situation but I did not
receive any new updates from GE.
This fact rules out the CrossThru as a fully operative benchmark tool. But I will still
try to perform as many tests as possible to get a general idea about the performance of
CrossThru.
34
Chapter 6
Benchmarking
In this chapter I will describe how I ran the proposed benchmarks and I will show the
configuration I used. Also there was a request to have as much automation as possible. So I
focused on how to minimize the human interaction in running the benchmarks. As a result
of this, various script files are presented and included in the appendix. Nevertheless, the
automation was not always feasible.
6.1
iperf
As I mentioned before, iperf is a benchmark that requires Linux operating system. Therefore, we need to boot the Linux kernel first. This procedure is described in subsection
4.2.2.1. For better performance, all available cores should be booted up. Thus, the core
mask 0xfff is required.
iperf is not a common part of the Linux kernel so I had to add it to the filesystem first.
There were two options how to do it. The first option was to use aptitude to download the
package from the Internet and save it to the chroot environment. But that would require to
set up a outside network connection in our laboratory, proxy and etc. I used an approach
that adds iperf via the downloaded package directly into the root filesystem. This approach
is described in subsection 4.2.2.
For server I used the command iperf -s -w 64KB, and similarly for the client I used
iperf -c 10.101.1.100 -w 64KB -l 64 -P 12 -t 120.
The value of the window size was chosen based on results of various tests I performed.
The length is a variable from the set defined by RFC 2544 [10]. The -P parameter reflects
the fact that there are 12 cores on the chip. This means that each core will be assigned
to one connection run in parallel. Without this parameter I got really poor results. This
was caused by the fact that all the computing was done on one core, and this core cannot
handle such a load. The last parameter sets the duration of the test the default value of
10 seconds seemed to be too short for trustworthy results.
With iperf I successfully measured transmitting and receiving capabilities of the card
in the user-mode. The tests I performed used the TCP and UDP protocol with IPv4 and
IPv6. The size of packets was in accordance to RFC 2544.
35
36
CHAPTER 6. BENCHMARKING
For transmitting I used WANic 56512 as a generator and other PC as a receiver. For
receiving I used the PC as a generator and WANic 56512 as a receiver.
The level of automation for iperf is pretty high. The user just needs to run the server
and then a script. Separate results for each size of the packet is then saved to a file. TCP
and UDP is measured separately. The source code of the scripts is included in the appendix.
$ ./iperf_script_tcp - for the TCP
$ ./iperf_script_udp - for the UDP
6.2
netperf
This program is very similar to the previous one. Therefore, the steps necessary to perform
my measurements are alike. First, I added netperf the same way as I added iperf. Then,
I booted the kernel with all available cores.
I started the server with the command netserver. On the client side I used
netperf -H 10.101.1.100 -t TCP_STREAM -- -m 64 -s 64 -l 120 -D. The bounding to
a specific CPU did not have any effect on the performance so I omitted this parameter. I
also tried other tests, e.g. TCP_SENDFILE and UDPSTREAM.
I was able to measure the potential of the card for transmitting and receiving. I used
both protocols, TCP and UDP with IPv4 and IPv6, and all the packet sizes defined in RFC
2544.
In terms of automation I created a script that runs netperf for every packet size and
all the three tests. The output is then stored to a new file.
$ ./netperf_script
6.3
curl-loader
Unfortunately, I was not able to cross-compile curl-loader to the MIPS architecture. There
were some problems with missing libraries, and when I cross-compiled them they did not
work on the MIPS architecture. On the other hand, this is not a big problem since I was
able to measure transmitting of packets using other benchmarks. And with curl-loader
there is nothing except transmitting to measure.
6.4
pktgen
I already described what is needed to run pktgen in subsection 5.2.4. As a configuration

script I used an example from [13] and modified it to correspond with my configuration. For
multi-core environment I had to add following for each core:
6.5. BRIDGE-UTILS
37
PGDEV=/proc/net/pktgen/kpktgend_0
pgset "add_device xaui0@0"
....
PGDEV=/proc/net/pktgen/xaui0@0
pgset "pkt_size 60"
pgset "flag QUEUE_MAP_CPU"
Then, I booted to Linux kernel and added appropriate modules via modprobe. I was
able to measure packet generating for each packet size in respect to RFC 2544. For more
accurate results I used the internal counters in the switch. Therefore, there is not much to
automate. The user just needs to run a script with the correct packet size.
$ ./pktgen_<size> - where <size> is the size of packets (64, 128, etc.)
6.5
bridge-utils
bridge-utils is the only benchmark to measure switching on Linux I could use. It was
quite easy to get bridge-utils working. First, I needed to change the kernel config as
I mentioned in subsection 5.2.5. After booting the Linux kernel and creating the chroot
environment, I could continue with setting up the bridge.
ifconfig xaui0 0.0.0.0
ifconfig xaui1 0.0.0.0
brctl addbr "MYBRIDGE"
brctl addif MYBRIDGE xaui0
brctl addif MYBRIDGE xaui1
ifconfig MYBRIDGE up
These commands created the bridge from the two interfaces on the card. After that I just
needed to generate traffic at a specific speed and with desired parameters, such as packet
size. For this purpose I used the traffic generator created by Moris Bangoura in his thesis [9].
The packets were received in one interface and transmitted in the other. Throughput and
frame loss were obtained from internal counters on our H3C switch. For this measurement
I used UDP traffic with a packet size of 64 1518B.
Unfortunately, there is no way how to automate this measurement since I used the
internal counters in the switch to get results. The only thing which could be automated is
the process of generating the packets. This has been already done by Moris and I used his
scripts.
6.6
traffic-gen
traffic-gen is the first program that uses the Simple Executive API. It is a part of the
cnusers SDK and it contains only the source code. Therefore, I needed to cross-compile it
first. The cross-compilation can be done by:
38
CHAPTER 6. BENCHMARKING
cd examples/traffic-gen
make
This will generate a binary file which can be later booted as I already showed in subsection 5.3.1. After a successful boot, I needed to enter commands to generate packets.
Receiving is done automatically when traffic comes to one of the two interfaces on the
card. Bridging can be turned on by the command bridge. All the necessary commands are
included in the scripts. Obtaining results is done by reading the values from a serial output.
With traffic-gen I was able to successfully measure:
TX, RX, throughput, frame loss with IPv4, TCP, packet size 64 1518B, all cores.
TX, RX, throughput, frame loss with IPv4, UDP, packet size 64 1518B, all cores.
TX, RX, throughput, frame loss with IPv6, TCP, packet size 64 1518B, all cores.
TX, RX, throughput, frame loss with IPv6, UDP, packet size 64 1518B, all cores.
Since we want to have things automated, I prepared a set of scripts. The user can issue
the script by typing:
$ cat traffic_<size>_[v4|v6]_[tcp|udp] > /dev/ttyX
where <size> is the size of packets (64, 128, etc.) and X is the port of the serial cable.
It is obvious that we need to use a serial connection to perform this measurement.
Unfortunately, this brings some limitations to the automation there is no way how to
retrieve the results from the console automatically. This has to be done manually.
6.7
CrossThru
CrossThru is a benchmark that can be used for fast forwarding or switching. It also uses
the Simple Executive API and runs in the Standalone mode as traffic-gen. CrossThru
is a part of the SDK from GE and I already described how it can be cross-compiled in
subsection 5.3.2.
After loading the image into the memory and booting, the application does nothing.
There is no input on the console and the application waits for incoming traffic. When traffic
comes to one of the interfaces, it is forwarded to the other interface. Since the card has only
two interfaces, we can talk about fast forwarding.
As I mentioned in subsection 5.3.2, CrossThru can operate in two modes and these
modes can be further optimized. I tried all four combinations to see which one can give us
the best results. I again used the generator made by Moris to generate traffic I could send
to WANic 56512. I used pktgen to generate UDP traffic with a packet size of 64 1518B.
The results were obtained from the internal counters of the H3C switch.
Since I used the internal counters, there is not much to automate. Again, the only thing
that could be automated, is the generation of packets using pktgen.
Chapter 7
Analysis of the benchmarks

7.1
iperf
As I expected, the obtained results indicate that it is not really worth it to run not optimized
benchmarks on WANic 5612. The measured values are so low also due to the fact that iperf
runs in the user-space mode.
Transmitting and receiving gave me very similar results. However, they do not meet
the expectations for 10 GbE networks. Figure 7.1 shows that iperf is not capable of
transmitting or receiving packets at wire speed. Figure 7.2 shows that for the UDP protocol
I got even worse results. This was caused by a CPU bottleneck which is clearly visible in
the graph. I tried to distribute the load between the CPUs but it did not have any effect
on the results. The CPUs were still used at 100%. Also there was no difference between the
IPv4 and IPv6.
I cannot compare my results with results from Moris or Radim because they did not
use iperf. And comparison between iperf and pktgen would not be useful since these
programs run in different modes.
7.2
netperf
With netperf the situation is very similar to the one with iperf. I would not recommend
this benchmark to perform measurements on WANic 56512.
As I mentioned previously, the advantage of netperf should be the availability to send
pieces of a file instead of a generated stream. As figure 7.3 shows, I did not get very different
results. That was caused by the fact that there were no CPU bottlenecks for the TCP_STREAM
measurement. And sending the file instead of the stream should ease the load of CPU.
There was only a UDP_STREAM test available for the UDP protocol. Figure 7.4 shows
that this test gives me better results than when I used the TCP protocol. This happened
because UDP is easier to generate, transmit or receive than the TCP protocol. And also by
the fact that there were no CPU bottlenecks. The IPv4 and IPv6 gave me similar results.
I have to note that I omitted the receiving part from the graphs since it was almost
identical to transmitting. And again, I cannot compare my results since Moris and Radim
did not use netperf.
39
40
7.3
CHAPTER 7. ANALYSIS OF THE BENCHMARKS
pktgen
From the results (figure 7.5) it is apparent that there is some sort of TX bottleneck. It is
very hard to identify the cause in this case, since I could not use ethtool to receive detailed
statistics. The processing load was divided among all cores and they were most of the time
in idle mode (~92%). It can be due to unavailability of setting the affinity and interruptions.
This tool could be used as a packet generator but only for larger sizes. It can generate
at wire speed for packets of size 1024B and above.
If I compare my results with the results from Radim and Moris, I got worse results than
them. Radim can generate packets at wirespeed from 128B and Moris from 64B packets.
Again, I have to point out the fact that they could optimize their network cards for the
pktgen which I could not.
7.4
bridge-utils
This benchmark gave me very poor results. It can forward packets almost at wire speed
for sizes 512B and 1024B, and at wire speed for sizes 1280B and above. Figure 7.6 shows
comparison between the theoretical maximum and my results. As can be seen in figure 7.7,
64B packets are massively dropped unless the offered load goes to 10%.
These unsatisfactory results are caused due to the fact that brctl application is not
optimized for WANic 56512. And concurrently, I couldnt perform any optimizations with
WANic 56512 as Moris Bangoura did with Intel. The results I obtained are very similar
to results presented by Radim Roka in his thesis. This fact leads me to a conclusion that
similar results can be used as a baseline for brctl benchmark for 10 GbE NIC when no
optimizations are made.
7.5
traffic-gen
With this tool I was able to verify that the card is capable of transmitting, receiving and
fast-forwarding all packets at wire speed. I write fast-forwarding intentionally because this
program does not create the MAC table which is needed for switching. It just forwards
packets from one interface on the card to another one.
There were no differences between IPv4 and IPv6, and also between the protocols TCP
and UDP. I even tried different types of payload, such as random, ascending/descending
sequence, text, and all of this had no effect on the overall performance.
This tool gave me the best possible results from all the benchmarks and also it has a
wide range of usage transmitting, receiving and fast-forwarding. I present only one picture
for traffic-gen (figure 7.8) because the graph for TX and RX would look the same, and
for frame loss it would be an empty graph.
7.6
CrossThru
I did not get the results as I would expect from a benchmark developed by the manufacturer
of WANic 56512. In fact, the results were rather poor for an application developed with
7.7. GRAPHS
41
Simple Executive API. I can only guess why the results are that bad, since I could not run
the diagnostic program for CrossThru. The technical support of GE Intelligent Platforms
could not solve my problem with dbgse utility. My guess is that the application is not yet
fully optimized for Cavium processors as traffic-gen is it is still version 0.2
But even these poor results were better than with brctl. I noticed that there were
only slight differences between the Fast Forwarding and Layer 2 Ethernet Switching mode.
Therefore all of the graphs are for both modes. But there were visible improvements between
the basic and optimized version.
Figure 7.9 shows that the optimized mode is capable of switching 256B packets and
above at wire speed. The basic mode can do the same for packets starting at size 512B.
Figures 7.10 and 7.11 show that 64B packets are heavily dropped until the offered load is
lowered to 30%.
7.7
Graphs
In this section I present all the graphs. The unit of frame rate is Packets per Second (pps).
Figure 7.1: TX and RX test - TCP - iperf

As we can see from the obtained results, WANic 56512 is not very suitable candidate to
use with the Linux operating system. That could be expected since there are no mechanisms
to optimize Linux programs to run on the card.
But the results also show that WANic 56512 is capable of transmitting, receiving and
fast-forwarding 64B packets at wirespeed. And I believe that the card has a great potential
if the right, optimized software is run on it. If we would have more money to buy the
software toolkits from Cavium Networks [6], we could run more tests on the card. There are
toolkits available for TCP/IP, IPsec, protocol analysis, IDS/IPS, etc. It is clear that using
the card with software that has no optimization for WANic 56512 or the OCTEON chip, is
really wasting with the potential of the card.
42
Figure 7.2: TX test - UDP - iperf
Figure 7.3: TCP_STREAM and TCP_SENDFILE test - netperf
7.7. GRAPHS
43
Figure 7.4: UDP_STREAM test - netperf
Figure 7.5: TX test - pktgen
44
Figure 7.6: RFC 2544 - throughput test - brctl
Figure 7.7: RFC 2544 - frame loss test - brctl
7.7. GRAPHS
45
Figure 7.8: RFC 2544 - throughput test - traffic-gen
Figure 7.9: RFC 2544 - throughput test - CrossThru FF+L2
46
Figure 7.10: RFC 2544 - frame loss test - CrossThru FF+L2 basic
Figure 7.11: RFC 2544 - frame loss test - CrossThru FF+L2 optimized
Chapter 8
Conclusion
All the requested tasks were successfully accomplished. The installation guide provides in
detail all the necessary steps to install WANic 56512 into the PC in our laboratory. This
will help to significantly reduce the required time for deployment of the card in future work.
This guide also shows how to install additional software for the card which will be needed in
future development. A part of this thesis then covers the basis of packet flow and hardware
acceleration units that are needed to get an understanding of how the accelerators work.
A various set of benchmarks was proposed in this thesis to test multiple parameters of
WANic 56512. Although the results were not completely satisfactory, they helped me to
come up with several conclusions. The first conclusion is that WANic 56512 is not built to
work with benchmarks or programs that are programmed for the Linux operating system.
It is due to the fact that these programs cannot use the hardware acceleration units that are
included on the card. Therefore, and this the second conclusion, we have to buy or develop
programs for WANic 56512 in particular. I tried to get such software for our university. I
asked Cavium Networks whether they can give us the software they used to get the results
they present in their product brief [4]. Unfortunately, the marketing team said that they do
not provide free support, it is not in accordance with their business model. And the results
presented in the product brief cannot be obtained unless we buy additional software from
them [6].
Still, the mentioned benchmarks gave me a basic idea of what WANic 56512 is capable
of. This card can at least generate, transmit, receive and fast-forward 64B packets at wire
speed which is about 14.8 Mpps. But to achieve such results, the optimized software needs
to be run.
In terms of the comparison with results from Radim Roka and Moris Bangoura, WANic
56512 lies somewhere in between. The card gave me better results than Roka presents in
his thesis and similar results that Bangoura got with the use of GPUs. But Moris went
in his measurements even further, he was able to perform routing and firewalling on his
prototype. I could not perform such tests because we do not have required software for
these tasks. And we also have to consider the price of proposed solutions. WANic 56512
was 129.000 K and the SDK was about 50000 K. On the other hand, Moris needed two
Intel 10 GbE network cards each was 15000 K and two graphic cards, 15000 K for each.
47
48
8.1
CHAPTER 8. CONCLUSION
Future work
I think that there could be many following projects with this card. But first, we have to buy
the SDK from Cavium Networks to be able to develop our own programs for WANic 56512
and the OCTEON chip. Right now, we do not have the required documentation about all
the simple executive functions that are required in development of new programs.
One of the projects could be in a cooperation with GE Intelligent Platforms to help
them develop and optimize the CrossThru application. This will result in an application
that could be capable of switching 64B packets at wire speed. Then, we could develop
a program for routing and firewalling, and possibly getting even better results than the
prototype made by Moris. And last but not least, we could use WANic 56512 for a deep
packet inspection and as a packet analyzer in realtime.
Bibliography
[1] Users Guide Debug Simple Executive Utility for Linux, 2011.
[2] Users Guide OCTEON CrossThru Application, 2011.
[3] Cavium Networks - Products > OCTEON Plus MIPS64 Processors > Silicon [online].
2012. [cit. 8. 12. 2012]. <http://www.cavium.com/OCTEON-Plus_CN56XX.html>.
R
[4] OCTEON
Plus CN56XX 8 to 12-Core MIPS64-Based SoCs [online]. 2011.
[cit. 9. 12. 2012]. <http://www.cavium.com/pdfFiles/CN56XX_PB_Rev1.3.pdf>.
[5] Reference Manual WANic*-56512 Packet Processor, 2011.

[6] CSS Software Toolkits [online]. 2011. [cit. 9. 12. 2012]. <http://www.cavium.com/css_
software_toolkits.html>.
[7] Octeon Simple Executive based Traffic Generator, 2012.
[8] WANic 56512 Packet Processor [online]. 2010. [cit. 8. 12. 2012]. <http://www.ge-ip.
com/account/download/11628/3347>.
[9] BANGOURA, M. 10GbE Routing on PC with GNU/Linux [online]. 2012. <https:
//dip.felk.cvut.cz/browse/pdfcache/bangomor_2012dipl.pdf>.
[10] BRADNER, S. MCQUAID, J. Benchmarking Methodology for Network Interconnect Devices. RFC 2544 (Informational), March 1999. <http://www.ietf.org/rfc/
rfc2544.txt>. Updated by RFC 6201.
R
[11] CURTIS, J. OCTEON
Programmers Guide, 2010.
[12] MENG, J. et al. Towards high-performance IPsec on cavium OCTEON platform. In

Proceedings of the Second international conference on Trusted Systems, INTRUST10,
s. 3746, Berlin, Heidelberg, 2011. Springer-Verlag. doi: 10.1007/978-3-642-25283-9_3.
<http://dx.doi.org/10.1007/978-3-642-25283-9_3>. ISBN 978-3-642-25282-2.
[13] OLSSON, R. pktgen examples [online]. 2008. [cit. 9. 12. 2012]. <ftp://robur.slu.se/
pub/Linux/net-development/pktgen-testing/examples/>.
[14] OLSSON, R.
[PATCH] pktgen:
multiqueue etc. [online]. 2008.
[cit. 9. 12. 2012]. <http://git.et.redhat.com/?p=kernel-kraxel.git;a=patch;h=
e6fce5b916cd7f7f79b2b3e53ba74bbfc1d7cf8b>.
49
50
BIBLIOGRAPHY
[15] ROHRBACHER, M. Measurement of throughput of Ethernet Cards on Telum

NPA-5854 device [online]. 2010. <https://dip.felk.cvut.cz/browse/pdfcache/
rohrbmi1_2010bach.pdf>.
[16] ROKA, R. Performance evaluation of GNU/Linux network bridge [online]. 2011.
<https://dip.felk.cvut.cz/browse/pdfcache/roskarad_2011dipl.pdf>.
[17] Wikipedia contributors. MIPS Architecture (Pipelined) [online]. 2009. [cit. 9. 12. 2012].
<http://en.wikipedia.org/wiki/File:MIPS_Architecture_(Pipelined).svg>.
Appendix A
Scripts
In this appendix I will show examples of the scripts I made for automation.
A.1
iperf
SIZE=(64 128 256 512 1024 1280 1518)

HOST=10.101.1.100
for size in ${SIZE[*]} ; do
echo $size - iperf -c $HOST -w 64KB -l $size -f b -P 12 -t 120 \
| grep SUM | cut -d" " -f12 | tee -a iperf_tcp
done
A.2
netperf
SIZE=(64 128 256 512 1024 1280 1518)

HOST=10.101.1.100
echo TCP_STREAM > netperf_tcp
echo $size - netperf -H $HOST -t TCP_STREAM -- -m $size \
-s $size -D | grep $size | tee -a netperf_tcp
done
echo TCP_SENDFILE >> netperf_tcp
echo $size - netperf -H $HOST -t TCP_SENDFILE -F big_file.iso -- \
-m $size -s $size -D | grep $size | tee -a netperf_tcp
51
52
APPENDIX A. SCRIPTS
done
echo UDP_STREAM > netperf_udp
echo $size - netperf -H $HOST -t TCP_STREAM -- -m $size \
-s $size | grep $size | tee -a netperf_udp
done
A.3
pktgen
#! /bin/bash
#rmmod pktgen
modprobe pktgen
# PACKET SIZE - NIC adds 4 bytes CRC

PKT_SIZE="pkt_size 60"
COUNT="count 0" # pkts to send, 0 is infinity
DELAY="delay 0" # delay 0 means maximum speed
# thread config
SIRQ=1000
CLONE_SKB="clone_skb 8"
function pgset() {
local result
echo $1 > $PGDEV
result=cat $PGDEV | fgrep "Result: OK:"
if [ "$result" = "" ]; then
cat $PGDEV | fgrep Result:
fi
}
function pg() {
echo inject > $PGDEV
cat $PGDEV
}
A.3. PKTGEN
# Config Start Here ---------------------------------------------------
pgset "rem_device_all"
pgset "max_before_softirq $SIRQ"
...
echo "Configuring $PGDEV"
pgset "$COUNT"
pgset "$CLONE_SKB"
pgset "$PKT_SIZE"
pgset "$DELAY"
pgset "dst_min 10.0.2.100"
pgset "dst_max 10.0.2.149"
pgset "udp_dst_min 1"
pgset "udp_dst_max 1024"
# pgset "flag IPDST_RND" # between dst_min and dst_max
# pgset "dst_mac 00:25:90:00:00:00" # fake mac pktgen test
pgset "dst_mac 00:12:c0:4c:6e:09" # real mac
pgset "queue_map_min 0"
# pgset "queue_map_max 3"
pgset "$COUNT"
pgset "$CLONE_SKB"
pgset "$PKT_SIZE"
pgset "$DELAY"
53
54
APPENDIX A. SCRIPTS

...
pgset "$COUNT"
pgset "$CLONE_SKB"
pgset "$PKT_SIZE"
pgset "$DELAY"
# Time to run
PGDEV=/proc/net/pktgen/pgctrl
echo "Running... ctrl^C to stop"
pgset "start"
echo "Done"
# Result can be :vieved in /proc/net/pktgen/xauiX
A.4
traffic-gen
row 1 43 on
row 48 60 off
row 79 81 on
tx.percent 0 1000
tx.percent 16 1000
tx.size 0 64
A.4. TRAFFIC-GEN
tx.size 16 64
tx.type 0 IPv4+TCP
tx.type 16 IPv4+TCP
start 0
start 16
55
56
APPENDIX A. SCRIPTS
Appendix B
CD content
There are following directories on the CD:
data_for_graphs
Includes the data I obtained during my measurements.
install
Includes the corrected install script to install the SDK.
scripts
All the required automation scripts.
text
A .pdf file with the thesis.
text_src
Contains the source code of the thesis.
57

Acceleration of Network Traffic

Încărcat de

Informații document

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Acceleration of Network Traffic

Încărcat de

Drepturi de autor:

Formate disponibile

Na tomto mst bude oficiln

Czech Technical University in Prague

Acceleration of 10GbE Network Traffic

Supervisor: Ing. Jan Kubr

Study Programme: Electrical Engineering and Information Technology

In Klatovy on Dec 31, 2012

3 Theoretical description of WANic 56512

4 Installation of WANic 56512

5 Benchmarks for WANic 56512

Block diagram of WANic 56512. Source: [8] .

Topology of the lab. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

CrossThru Flowchart. Source: [2] . . . . . . . . . . . . . . . . . . . . . . . . 31

TX and RX test - TCP - iperf . . . . . . . . . . . . . . .

Motivation and goal

Structure of the thesis

This thesis has eight chapters:

Performance evaluation of GNU/Linux network bridge

CHAPTER 2. RELATED WORKS

This configuration was tested against following hardware switches:

10GbE Routing on PC with GNU/Linux

2.2. 10GBE ROUTING ON PC WITH GNU/LINUX

CHAPTER 2. RELATED WORKS

Theoretical description of WANic

CHAPTER 3. THEORETICAL DESCRIPTION OF WANIC 56512

Figure 3.1: Block diagram of WANic 56512. Source: [8]

packet I/O processing, Quality of Service (QoS), TCP,

The MIPS Architecture

3.1. GENERAL OVERVIEW

Figure 3.2: Block diagram of OCTEON CN5650. Source: [3]

Advantages of using registers:

CHAPTER 3. THEORETICAL DESCRIPTION OF WANIC 56512

All instructions take the same processing time.

Figure 3.3: MIPS architecture, pipelined. Source: [17]

Comparison between the MIPS and x86 architecture

3.2. HARDWARE ACCELERATION UNITS

Hardware Acceleration Units

CHAPTER 3. THEORETICAL DESCRIPTION OF WANIC 56512

Figure 3.4: Packet input. Source: [11]

3.4. SIMPLE EXECUTIVE

Figure 3.5: SSO and core processing. Source: [11]

The simple executive is an Application Programming Interface (API) which provides a

CHAPTER 3. THEORETICAL DESCRIPTION OF WANIC 56512

Figure 3.6: Packet output. Source: [11]

Installation of WANic 56512

Description/Specification of the host system

The WANic 56512 is inserted into the following host system:

2x Corsair 2GB DDR2 RAM (CM2X2048-6400C5, 800 MHz),

CHAPTER 4. INSTALLATION OF WANIC 56512

Figure 4.1: Topology of the lab.

Installation procedure without buying the SDK

Diagnostic modes kernel

4.2. INSTALLATION PROCEDURES

Figure 4.2: The needed DIP switch. Source: [5]

CHAPTER 4. INSTALLATION OF WANIC 56512

Installation procedure with buying the SDK

4.2. INSTALLATION PROCEDURES

CHAPTER 4. INSTALLATION OF WANIC 56512

octeon_drv: module license Cavium Networks taints kernel.

4.2. INSTALLATION PROCEDURES

Octeon 0000:02:00.0: irq 31 for MSI/MSI-X

CHAPTER 4. INSTALLATION OF WANIC 56512

4.2. INSTALLATION PROCEDURES

Ethernet PCI mode

CHAPTER 4. INSTALLATION OF WANIC 56512