Allegheny Thesis

Technical Report CS10-02
A Genetic Algorithm to Improve

Kernel Performance on
Resource-Constrained Devices
Jim Kukunas
Submitted to the Faculty of

The Department of Computer Science
Project Director: Dr. G. M. Kapfhammer

Second Reader: Dr. R. D. Cupper
Allegheny College
2010
I hereby recognize and pledge to fulfill my

responsibilities as defined in the Honor Code, and
to maintain the integrity of both myself and the
college community as a whole.
Jim Kukunas
Copyright c 2010
Jim Kukunas
All rights reserved
ii
JIM KUKUNAS. A Genetic Algorithm to Improve Kernel Performance
on Resource-Constrained Devices.
(Under the direction of Dr. G. M. Kapfhammer.)
Abstract
As computers become increasingly mobile, users demand more functionality, longer

battery-life, and higher performance from mobile devices. To satiate these demands,
chipset fabricators are focusing on unique and elegant architectures to provide low-
power, high-performance solutions. As these architectures rely on unique x86 exten-
sions rather than fast clock speeds and large caches, careful thought must be placed
into effective optimization strategies for not only user applications, but also the kernel
itself, as the typical “blanket” optimizations used by today’s compilers do not often
take advantage of these specialized architectures. Focusing on the Intel Diamondville
platform, this paper presents a genetic algorithm to evolve the compiler flags to build
the Linux kernel to improve kernel performance. The kernels evolved by the genetic
algorithm outperformed the stock Fedora kernel in 65% of the test cases.
iii
Acknowledgments
I would like to thank my family, friends, and advisors, Dr. Gregory Kapfhammer and
Dr. Robert Cupper, for all their help and support throughout this long process.
iv
Contents
Acknowledgments iv
List of Tables vii
List of Figures viii
1 Overview 1
1.1 Diamondville . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2 Linux Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Intel C Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.4 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.5 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.6 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2 Prior Work 13
2.1 ACOVEA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 MILEPOST GCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Cooper et. al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.4 COLE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Davidson et. al. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Design and Implementation 17

3.1 The Profiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 The Genetic Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.1 Population Representation . . . . . . . . . . . . . . . . . . . . 19
3.2.2 Compiler Option Initialization . . . . . . . . . . . . . . . . . . 21
3.2.3 Initial Population . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.4 Fitness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.5 Reproduction, Mutation, and Selection . . . . . . . . . . . . . 26
3.3 Build Farm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.4 Testing Farm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 Results 29
4.1 Configurations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Resulting Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
v
4.3 Produced Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 Application Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4.1 Tests with Improved Performance . . . . . . . . . . . . . . . . 34
4.4.2 Tests with Decreased Performance . . . . . . . . . . . . . . . 38
4.4.3 Tests with Static Performance . . . . . . . . . . . . . . . . . . 39
4.5 Results Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.6 Threats To Validity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5 Conclusion and Future Work 42

5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.3.1 Continuations . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.3.2 Dynamic Option Parsing . . . . . . . . . . . . . . . . . . . . . 43
5.3.3 Random Number Evaluation . . . . . . . . . . . . . . . . . . . 43
5.3.4 Persistent Fitness Results . . . . . . . . . . . . . . . . . . . . 43
5.3.5 Profiles, Generations, Population Size . . . . . . . . . . . . . . 43
A Code Listings 44
A.1 Profiler Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
A.2 GA Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
A.2.1 main.h . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
A.2.2 main.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
A.2.3 Makefile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
A.3 Builder Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
A.3.1 main.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
A.3.2 Makefile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
A.4 Client Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
A.4.1 main.c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
A.4.2 check.sh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
A.4.3 Makefile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
A.5 Compiler Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
B Phoronix Results 101

B.1 Fedora Stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
B.2 ICC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
B.3 Run0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
B.4 Run1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
B.5 Run2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
Bibliography 153
vi
List of Tables
1.1 Compiler Flags Set at Optimzation Levels for the Linux Intel C Compiler 6
4.1 Genetic Algorithm Configurations. . . . . . . . . . . . . . . . . . . . 30

4.2 Application Profiler Results. . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Highest Fitnesses For Each Configuration. . . . . . . . . . . . . . . . 30
4.4 Phoronix Netbook Test Suite. . . . . . . . . . . . . . . . . . . . . . . 33
4.5 LAME Test Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.6 OGG Test Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.7 Scimark0 Test Results. . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.10 Sqllite Test Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.11 GNUPG Test Results. . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.12 Cray Test Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.13 GTKPerf0 Test Results. . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.16 7zip Test Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
4.17 IOZone0 Test Results. . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.20 Ram Test Results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.21 FFMpeg Test Results. . . . . . . . . . . . . . . . . . . . . . . . . . . 40
vii
List of Figures
1.1 Kernel and User Space Interaction. . . . . . . . . . . . . . . . . . . . 9

1.2 Flowchart For a Generic Genetic Algorithm . . . . . . . . . . . . . . 11
3.1 Results From Running the Profiler. . . . . . . . . . . . . . . . . . . . 17

3.2 Example Population. . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.1 Phoronix Benchmark Results (Smaller Is Better). . . . . . . . . . . . 41
viii
Chapter 1
Overview
“Performance isn’t a secondary concern. It changes how you work.”

– Linus Torvalds [29]
Optimizing compilers have the ability to perform various optimizations while

translating high-level code to machine language. As with most things, these opti-
mizations present trade-offs, which may be desirable in some situations, and unde-
sirable in other situations. For instance, most compilers are capable of performing
loop-unrolling. When a compiler performs loop-unrolling, the compiler attempts to
rewrite loops to reduce the number of loop iterations and thus reduce the overall num-
ber of instructions required to evaluate the loop constraint [1]. Consider the simple
example in Listing 1.
Listing 1 Loop before Loop Unrolling
#include<stdlib.h>
int main(int argc, char** argv)
{
int* test = malloc(sizeof(int)*100);
for(int i = 0; i < 100; i++) {
*(test+i) = 15;
}
free(test);
}
Source code Listing 1 iterates through every element in the array test and sets it
to 15. Without any optimization, this loop will perform 100 loop iterations, which
means that it will not only need to execute 100 instructions to store the values into
the array, but it will also need to check the loop constraint 101 times and increment
the loop counter variable 100 times. This can be seen in the assembly generated by
the Intel C++ Compiler in source code listing 2, with optimizations disabled.
Source code listing 3 demonstrates loop-unrolling performed on the previous ex-
ample. While the number of instructions to store the values into the array remain the
Listing 2 Compiler Generated Assembler Without Optimization.
movl $0, -8(%ebp)

movl -8(%ebp), %eax
cmpl $100, %eax
jge ..B1.5
..B1.4:
movl -8(%ebp), %eax
movl -12(%ebp), %edx
movl $15, (%edx, %eax, 4)
inc -8(%ebp)
movl -8(%ebp), %eax
cmpl $100, %eax
jl ..B1.4
..B1.5:
same, the number of loop iterations are reduced from 100 to 20, which means that
each add instruction replaces 5 inc instructions.
Listing 3 Loop Unrolling
#include<stdlib.h>
int main(int argc, char** argv)
{
int* test = malloc(sizeof(int)*100);
for(int i = 0; i < 100; i = i + 5) {
*(test+i) = 15;
*(test+i+1) = 15;
*(test+i+2) = 15;
*(test+i+3) = 15;
*(test+i+4) = 15;
}
free(test);
}
However, when we look at the assembly generated by this code in source code
listing 4, we notice that the resulting binary size has increased.
In typical userspace applications, the increased image size usually is outweighed
by the decreased execution time, however the performance of the Linux kernel is very
closely tied to its size in memory. A smaller kernel is a faster kernel, as a smaller
kernel image uses less memory, thus more memory is available for user applications.
Thus this optimization might actually hurt kernel performance, rather then help it.
2
Listing 4 Loop Unrolled Version of Program 1
movl $0, -8(%ebp)

movl -8(%ebp), %eax
cmpl $100, %eax
jge ..B1.5
..B1.4:
movl -8(%ebp), %eax
movl $15, $ecx
movl %ecx, (%edx, %eax, 4)
movl -8(%ebp), %eax

movl %ecx, 4(%edx, %eax, 4)
movl -8(%ebp), %eax

movl -8(%ebp), %eax

movl -8(%ebp), %eax

addl $5, -8(%ebp)

movl -8(%ebp), %eax
cmpl $100, %eax
jl ..B1.4
..B1.5
3
Often, developers do not perform these optimizations by hand, but rather allow the
compiler to perform these optimizations. For the Intel C++ Compiler, loop-unrolling
is activated with the command-line flag -funroll-all-loops. This optimization
is automatically activated by optimization level -O1, on x64 platforms, and -02, on
x86 platforms. The Intel Compiler also provides the flags -unroll-aggressive,
which allows the compiler to unroll certain loops, ones with small trip counts, com-
pletely, as well as -unroll[=n], which allows the user to specify the maximum
number of times a loop can be unrolled.
In the previous assembly listings, all optimizations other then the loop-unrolling
were disabled, however optimizations such as loop-unrolling, used in conjunction with
other optimizations, such as parallelization, can provide even stronger benefits. In
source code listing 5, the Intel Compiler exploited the implicit parallelism of the loop
with the use of single instruction, multiple data (SIMD) instructions. SIMD instruc-
tions are instructions which perform a single instruction in parallel over multiple
chunks of data.
This example demonstrates how optimizations exist in a delicate balance, which
must be taken into consideration when determining which optimizations to employ.
Even if an optimization provides a significant performance increase on its own, it is
null or detrimental if used in conjunction with other “optimizations” which hinder
it. To remedy this condition, vendors build predetermined optimization levels into
their compilers. Each level corresponds to a given amount of optimization. These
typical optimization levels activate a set of optimizations with one flag. Each of the
optimization levels targets a unique purpose. Optimization level s optimizes the code
for decreased image size. Optimization level 0 disables all compiler optimizations.
Optimization levels 1 − 3 provide different levels of compiler optimizations, 1 being
light optimizations, and 3 being aggressive optimizations. If no other optimization
flags are selected, 2 is the default for the Intel C++ Compiler. Table 1 provides a
detailed description of each level.
These blanket optimizations perform well in the general case, but when dealing
with highly specialized architectures, such as those found in mobile devices, more
consideration needs to be placed on specialized optimizations.
1.1 Diamondville
This work focuses on the Intel Diamondville platform, Intel’s first generation micro-
architecture designed for netbooks and mobile Internet devices.
The Diamondville platform includes the Intel Atom n270 processor, as well as the
945GME chipset and 82801DBM I/O controller [12]. The n270 is a 45nm fabricated
processor containing on-die a 32kB instruction cache, as well as a 24kB write-back
data cache. It operates a steppable frequency of 1.60 GHz and a front-side bus speed
of 533MHz.
4
Listing 5 Intel Optimized Version of Source Code 1
movl %eax, %edx

andl $15, %edx
je ..B1.9
..B1.5:
testb $3, %dl
jne ..B1.19
..B1.6:
negl %edx
andl $16, %edx
shrl $2, %edx
xorl %ecx, %ecx
..B1.7:
movl $15, (%eax, %ecx,4)
incl %ecx
cmpl %edx, %ecx
jb ..B1.7
..B1.9:
movdga _2il0floatpacket.28, %xmm0
movl %edx, $ecx
negl %ecx
andl $3, %ecx
negl %ecx
addl $100, %ecx
..B1.10:
movdga %xmmo, (%eax, %edx,4)
addl $4, %edx
cmpl %ecx, %edx
jb ..B1.10
..B1.12:
cmpl $100, %ecx
jae ..B1.16
..B1.14:
movl $15, (%eax, %ecx,4)
incl %ecx
cmpl $100, %ecx
jb ..B1.14
_2il0floatpacket.28:
.long 0x0000000f, 0x0000000f, 0x0000000f, 0x0000000f
.type _2il0floatpacket.23, @object
.size _2il0floatpacket.28, 16
5
Table 1.1: Compiler Flags Set at Optimzation Levels for the Linux Intel C Compiler
Optimization Level Optimization Flags
00 Disables All Optimizations
01 Global Optimizations
All optimizations from 01 and . . .
Constant Propagation
Copy Propagation
Dead-Code Elimination
Global Register Allocation
Loop Unrolling
Optimized Code Selection
02 Partial Redundancy Elimination
Strength Reduction/Induction Variable Simplification
Variable Renaming
Exception Handling Optimizations
Tail Recursions
Peephole Optimizations
Structure Assignment Lowering and Optimizations
Dead Store Eliminations
All optimizations from 02 and . . .
Prefetching
Scalar Replacement
03
Loop and Memory Access Transformations
Branch Elimination
Cache Padding
As opposed to micro-architectures designed for desktops and servers, mobile-

oriented micro-architectures, such as those found in laptops, are often judged on
factors other than just raw performance, such as power consumption, heat genera-
tion and form factor. Consideration for these other metrics is visible in the design of
modern processors. For example, most current Intel Desktop processors, and some
newer AMD server processors use land grid array (LGA) sockets, as these provide a
more stable current to the processor, due to higher pin densities, and thus allow for
higher and more stable clock frequencies [10]. On the other hand, most Intel and
AMD laptop processors use micro flip chip ball grid array sockets (Micro-FCBGA),
as these provide a thinner physical profiler, which is more suited for smaller devices,
and facilitates improved heat conduction.
This desire, for reduced power consumption, heat generation, and form factor, is
amplified with netbooks and mobile Internet devices, where users typically expect
increased mobility and battery-life in comparison to to larger laptops which are run-
ning micro-architectures already optimized for reduced power consumption and heat
generation. This forces chip fabricators to make significant reductions in the proces-
6
sor footprint to eliminate features that consume significant power, often hindering
performance.
The Intel Diamondville micro-architecture includes power-saving feature such as
dynamic cache sizing and dynamic bus parking [12]. Dynamic cache sizing, in low
power states, flushes and disables chunks of L2 cache to conserve power. Dynamic
bus parking powers down the chipset while the processor is in low-frequency mode.
While these features have a minor impact on performance, this potential impact is
not nearly as great as the exclusion of an out-of-order instruction scheduler.
In-Order Instruction Scheduler.
Unlike other typical Intel processors, the n270 utilizes an in-order instruction
scheduler for decreased power consumption and reduced heat generation. As opposed
to an out-of-order instruction scheduler, which reorders instructions at the processor
level, an in-order instruction scheduler executes instructions in the same order as
they were input [25]. Due to this static instruction placement, instructions that are
placed in less-than-optimal positions can cause bubbles to form in the pipeline and
thus reduce throughput. These bubbles can be avoided by reordering the instructions
at compile time. However, the compiler must be aware of the type of instruction
scheduler at the target platform to know how to accurately model the instruction
pipeline.
Consider the following example from Intel’s Atom optimization guide [20]. The
code in Listing 6, without careful optimization, can lead to a memory access depen-
dency stall. The assembler generated code from Listing 6 in Listing 7 demonstrates
these dependency stalls. While the first movl instruction is executing, the next in-
struction imull can not execute, due to the memory dependency of the value b in
register eax. This also occurs for the next move and multiply for d. Reordering the
instructions to match Listing 8 removes the dependency stalls, as both sets of movl
and imull instructions can be executed in parallel.
Listing 6 Example from Intel Atom Optimization guide [20].
a = b * 7;
c = d * 7;
Listing 7 Assembler Generated from Listing 6 with Memory Stall.
movl b, %eax
imull $7, %eax
movl %eax, a
movl d, %edx
imull $7, %edx
movl %edx, c
7
Listing 8 Optimized Assembler of Listing 6 to remove dependency stall
movl b, %eax
movl d, %edx
imull $7, %eax
imull $7, %edx
movl %eax, a
movl %edx, c
Movbe Instruction.
Due to the role of netbooks and mobile Internet devices, which often have to
interact with many different peripherals, efficient conversion between big-endian and
little-endian, and visa versa is important to compensate for the lack of high clock
speeds. One common situation which often requires endian conversions is networking,
where a machine may be required to communicate with a machine using a different
endianness. During high processor usage, this conversion can cause network transfers
to suffer. To remedy this, Intel added the movebe instruction, which performs a move
and byte swap, thus allowing for single instruction conversions. This instruction can
also be used to increase the performance of certain arithmetic operations. Since the
Intel Atom processor is the only processor line in the x86 family that supports this
instruction, the Intel C compiler is the only compiler that will generate code utilizing
this instruction [12].
1.2 Linux Kernel

While a system similar to the one implemented in this paper could optimize user space
applications rather then the operating system kernel, any performance gains achieved
would only be beneficial to that specific user space application, whereas optimizations
to kernel space are beneficial to all applications which interact with kernel space.
The Linux kernel is an open-source monolithic kernel initially created by Linus
Torvalds as a UNIX clone while Torvalds was a student at the University of Helsinki.
While initially targeting only the Intel 80386, Linux now supports many architectures
besides x86 including SPARC, MIPS, and ARM.
Monolithic kernels execute the entire operating system from one program run in
kernel mode, also known as kernel space or ring 0. To protect the system against
malicious behavior, the system is divided into security rings. Each ring represents
a series of restrictions, which are applied to all programs running in that ring. The
kernel runs in ring 0 because it provides full access to all processor instructions as
well as the hardware. Rings 1 and 2 are typically used for device drivers, and finally
ring 3 is used for user space. In ring 3, programs are forbidden from certain processor
instructions, as well as performing I/O. This separation allows for increased security,
as an application which is exploited in user space, can do significantly less damage
8
User Space
System Calls
I/O Memory Mgt. Process Mgt.
Kernel Space
Hardware
Figure 1.1: Kernel and User Space Interaction.
to the system, than an exploitation in kernel space. However, since user space ap-
plications often require access to the functionality controlled in kernel space, such as
memory management and I/O, they must interface with the kernel, which can then
perform the requested action. As shown in Figure 1.1, this interface is performed
through system calls [28].
By optimizing the kernel, we optimize the memory management, I/O and process
management of every process on the system.
1.3 Intel C Compiler

Intel offers optimizing compilers for C, C++, and Fortran. These compilers provide
specific optimizations for Intel platforms, ranging from the server-class Xeon proces-
sors to the netbook-class Atom processors. The LinuxDNA project has managed to
patch the Linux kernel to build using the Intel C compiler, rather than the GNU C
compiler, thus allowing Linux users to take advantage of the Intel C Compiler’s specific
platform optimizations for Intel chipsets [9]. These platform-specific optimizations,
not offered by other compilers, provide great potential for increased performance and
decreased power consumption. Initial results from the LinuxDNA project found up
to a 40% increase in performance within the kernel [24]. However, no work has been
performed to discover the optimal compiler flags for the Intel Diamondville platform.
The most straightforward approach to solving this problem is to perform iterative
compilation, compiling each possible kernel and then selecting the best. However,
this is not a feasible approach due to the size of the search space. This system tests
107 different compiler flags, which would yield 2107 different kernels. Therefore, it is
necessary to focus the search, without reducing the search space. This can be accom-
plished by applying a heuristic search technique, in this case a genetic algorithm, to
the search space.
9
1.4 Genetic Algorithms
Inspired by natural selection, genetic algorithms (GA) are an adaptive heuristic search
technique used for evolving solutions. The GA begins with the generation of a random
population, the individuals of which are encodings of solutions to the problem-set.
Each generation of the random population undergoes mutation, reproduction, selec-
tion, and fitness operators.
Fitness Operator.
The fitness operator evaluates each individual based on the quality of the solution
it represents. A fitness is then assigned to each individual which allows the individuals
to be compared. This allows the GA to encourage and nurture good solutions while
eliminating bad solutions.
Selection Operator.
After each individual is evaluated with the fitness operator, the selection operator
chooses which individuals will reproduce for the next generation of the GA. Typically,
selection operators foster elitism within the population to encourage better solutions.
There are multiple kinds of selection operators including, roulette-wheel selection,
tournament selection and truncation selection [19].
The most straightforward type of selection operator is truncation. The truncation
selection operator orders the population by fitness and then chooses a percentage of
the most fit individuals to reproduce [19].
The tournament selection operator runs multiple tournaments, where multiple
random individuals are compared. Only the individual with the highest fitness from
each tournament, the tournament winner, is chosen for reproduction [19].
The roulette-wheel selection operator, also known as the fitness proportionate
selection operator, assigns a percentage to each individual, based on the percentage
of the individual’s fitness compared to the total fitness available in the population.
Then, based on these percentages, individuals are chosen at random for reproduction.
Unlike the truncation and tournament selection operators, which attempt to keep as
many high-fitness individuals as possible, the roulette-wheel selection operator runs
the risk of selecting low-fitness individuals over high-fitness individuals, depending on
the random values chosen [19].
Mutation Operator.
The mutation operator randomly mutates some of the individuals, so as to add
genetic diversity within the population. To mutate an individual, the mutation op-
erator makes a subtle change to one or more parts of the individual at random. For
example, for an individual representing a set of compiler flags, a mutation might
involve setting a flag, that was previously on, to off, or visa versa.
Reproduction Operator.
The reproduction operator, combines two different individual’s encodings to pro-
duce a new individual with some characteristics of each of the original parent in-
dividuals. The overall goal of the reproduction operator is to combine the good
characteristics of each parent to produce a better child.
10
Figure 1.2: Flowchart For a Generic Genetic Algorithm
These operators are performed at each generation, with each generation producing
a local optimum [23]. Figure 1.2 demonstrates the typical flow of a GA.
Termination There is no way of knowing when a GA should terminate. Typically,
too few generations will not give the GA enough time to “evolve” a good solution.
Too many generations will waste time, if the best solution found by the GA was found
early on.
Due to the fact that genetic algorithms are a heuristic search technique, there is no
guarantee that the optimal solution found after N generations is indeed the optimal
solution, however typically GAs are good at finding local optimums, and solutions
better than the current solution at the start of the GA.
1.5 Thesis Statement

To solve the problem previously described, this work implemented a genetic algorithm
to evolve compiler flags. The reason for choosing a GA was influenced both by the
large search size requiring iteration, and previous work, which showed that the GA
would be able to monitor which sets of optimizations performed well together.
On the Intel Diamondville platform, a Linux kernel built with compiler flags
evolved from a genetic algorithm will improve user space application performance
for some applications when compared to a Linux kernel built with the standard
build flags, because the genetic algorithm will adapt to how different optimiza-
tions interact amongst themselves.
11
1.6 Thesis Outline
The next chapter discusses the prior work that motivated this work. The third chapter
describes the implementation and design details of constructing the system. The
fourth chapter shows the results achieved, and finally the fifth chapter provides ideas
for future work.
12
Chapter 2
Prior Work
“Some men give up their designs when they have almost reached the goal; while
others, on the contrary, obtain a victory by exerting, at the last moment, more
vigorous efforts than ever before”
– Herodotus of Halicarnassus
2.1 ACOVEA
The Analysis of Compiler Options via Evolutionary Algorithm system (ACOVEA) is
a C++ framework to implement a genetic algorithm to “find the “best” options for
compiling programs with the GNU Compiler Collection(GCC) C and C++ compilers
[16]. Currently versions of ACOVEA supporting the SPARC platform, as well as the
Intel C++ compiler are in development.
Population Representation and Initial Population.
ACOVEA initially represented individuals as a binary string, with each compiler
flag represented as a bit in a long long data primitive, typically 8 bytes on a 32 bit
system, thus limiting the number of potential considered options to 64. This was then
changed to reference extensible markup language (XML) descriptions of the compiler
and its options, to reduce the complexity of dealing with compiler flags which offer
multiple states [16]. However, the added overhead of parsing XML motivated a shift
from XML to an object hierarchy for the final representation of the population.
The initial population is created at random, however “blanket” optimizations,
described in Table 1, are added so that the random individuals must compete against
the typically chosen optimizations.
Following the biological model of African lions, ACOVEA attempts to run multi-
ple populations simultaneously. These populations, or prides, link together through
migration, where individuals relocate from one population to another. Over time, the
populations each approach unique genetic uniformity, with each population focusing
on locally optimal results. Then, as individuals migrate, these local optimums are
spread and combined among the populations to improve these results [16].
Fitness Evaluation.
ACOVEA evaluates fitness by compiling the given application, which the user
wants optimized, using each individual’s corresponding compiler options. The result-
ing executable is run and then assigned a fitness score based on the resulting execution
time. Individuals that cause compiler or program failure result in a fitness score, that
reduces the likelihood that such individuals are chosen for reproduction.
Reproduction and Mutation.
After each individual is assigned a fitness, a subset of the individuals are selected
for reproduction and fitness. In this system, two-point crossover, in which two indexes
are chosen within the parents, and then everything in between those indexes are
swapped, is used for reproduction.
Mutation, on the other hand, is performed either by switching a compiler op-
tion to on from off, or visa versa, or by changing the state of a compiler option
which can exist in multiple states. An example of the former would be changing
-fno-unroll-loops to -funroll-loops. An example of the latter would be
changing -fp-model=strict to -fp-model=fast. The first example enables
loop-unrolling, which was previously disabled. The second example changes the float-
ing point model from strict to fast.
Results.
ACOVEA has seen significant success, in some cases improving run time perfor-
mance by up to 57% [16]. While this system seems like a good solution to the problem
mentioned earlier, ACOVEA focuses more on small algorithmic benchmarks, rather
then on full systems. Therefore, it would not be an ideal system to use for optimiz-
ing the Linux kernel, however it does reinforce the successful applications of genetic
algorithms in optimization based scenarios.
2.2 MILEPOST GCC

MILEPOST GCC is an attempt to create an adaptive compiler, which does not require
the user to select optimizations, but rather will tune the optimizations automatically
to the given platform, through a plugin to the GNU C Compiler [7]. Rather then
typical static compilers, which simply perform selected optimizations, an adaptive
compiler searches the space of all these optimizations to find the one that provides
the best optimizations. Using this technique, MILEPOST hopes to deprecate the
need to hand write optimizing compilers for each hardware architecture [7].
Unlike a genetic algorithm, MILEPOST GCC operates in two distinct phases,
training and deployment.
Training.
The first phase trains the compiler, by allowing it to experiment with various
optimization strategies on the target platform. To perform this training, the compiler
must both perform static analysis on program features, as well as collect a sampling of
results from some of the optimization strategies, including metrics such as execution
time, and code size. The results are stored in a database for the next step [7].
Deployment.
From the results stored in the training phase, a model is created, which can predict
14
the results of a set of optimization strategies without actually compiling and running
the application, based on the program features in the code to be optimized. During
this phase, the actual application the user wishes to optimize is input to the compiler
and the optimization strategies determined effective on the training applications are
used [7].
Results.
Tested on multiple platforms, and on multiple popular open source projects such
as BerkeleyDB and Mozilla, MILEPOST GCC has seen significant success in practical
use. For example, on the Intel Xeon processor, MILEPOST GCC achieved a 140%
increase in performance, without a significant increase in code size [5].
2.3 Cooper et. al.

Cooper et. al. implemented a genetic algorithm to evolve compiler optimization or-
derings to reduce code size [4].
Population.
Each member of the population was represented by a fixed-length string of length
12, representing a sequence of 12 optimizations, and thus a search space of 1012
possibilities. A population size of 20 was used, as experiments with larger populations
did not appear to produce better results [4].
Fitness and Selection.
To assign each individual a fitness score, the application optimized is compiled and
then optimized using the optimization sequence the individual represents. The code is
then cleaned of “empty basic blocks” and Briggs style register allocation is performed
with 32 integer and 32 floating-point registers. The number of static operations in
the optimized code is then assigned as the fitness score.
After each individual is assigned a fitness score, the worst performing individual,
along with three other random individuals from the lower half of the population, is
removed [4].
Reproduction and Mutation.
Each generation, reproduction is performed twice to create 4 new individuals to
fill the vacancies created by the selection operator. Reproduction is performed by
combining the first half of one parent with the lower half of another parent. The
parents are chosen from the top half of the population.
Only 15 individuals in the population are eligible for mutation each generation.
The individual with the highest fitness, as well as the 4 newly created children during
reproduction, carry immunity from mutation. Individuals in the top half of the
population have a 5% change of mutation, and individuals in the lower half of the
population have a 10% of mutation. If selected for mutation, one of the individual’s
orderings is replaced with a randomly chosen phase.
During these phases, any duplicate individual is removed, and replaced with a
random individual [4].
Results.
15
The genetic algorithm is run for 1000 generations. Over these 1000 generations,
up to 40% smaller code resulted, by employing the sequence evolved by the genetic
algorithm [4].
2.4 COLE
The Compiler Optimization Level Exploration (COLE) system claims to be the first
multi-objective evolutionary algorithm, inspired by a genetic algorithm, designed to
find Pareto optimal optimization levels. While most systems optimize for either
reduced execution time or reduced space, COLE allows for multi-objective fitness.
This allows for the system to optimize for both reduced execution time and code
size. Due to these multiple fitness constraints, COLE evaluates fitness based on
Pareto optimality. Pareto optimality refers to performing better for at least one
objective, while performing at least as well for the other optimizations for the other
objectives. The COLE system outperformed both GCC’s standard optimization levels
and random search techniques when applied to the SPEC CPU2000 benchmarks [8].
2.5 Davidson et. al.

Davidson et. al. implemented a heuristic search technique to detect the optimal or-
dering of compiler optimization phases. First, the search space is pruned by detecting
and removing dormant phases, identical function instances, and equivalent function
instances. Then, the dynamic performance of each remaining possibility is evaluated
to determine the optimal solution. As a result, on the ARM platform, using the VPO
compiler, 86% of cases reached an optimal state [15].
16
Chapter 3
Design and Implementation
“Simple things should be simple and complex things should be possible”

– Alan Kay
The system described consists of 4 main segments. The first segment is the ap-
plication profiler, which determines how fitness will be evaluated within the genetic
algorithm. The second segment is the genetic algorithm, which drives the entire sys-
tem. The third segment is the build farm, which allows for distribution of the kernel
building workload. Finally, the fourth segment is the testing farm, which provides
distribution of the kernel evaluation workload.
3.1 The Profiler
Figure 3.1: Results From Running the Profiler.
To understand how kernel performance impacts user space performance, we take

a close look at the interactions between user space and kernel space. As discussed
in Section 1.2, system calls are the means by which user space interacts with kernel
space, thus by monitoring system call usage in user space, interaction with kernel can
be better understood. As shown in Algorithm 1, we can monitor system call usage
using ptrace. Ptrace allows a parent process to trace through a child process,
waking the parent for events such as the child exiting, or the child invoking a system
call.
As the parent process is awoken, a context switch is performed and thus all of
the registers, containing the information regarding the system call invoked by the
child process, are saved to disk. The parent process must examine the child’s address
space to recover these registers. System calls on x86 platforms can be performed in
three ways,defined in /arch/x86/ia32/ia32entry.S. The first method utilizes
the SYSENTER instruction. In this method, the system call number is placed in
the accumulator register, EAX, and then the five arguments are placed in EBX, ECX,
EDX, ESI, and EBP registers respectively. The second method utilizes the SYSCALL
instruction. In this method, the system call number is placed in the EAX register, and
then the five arguments are placed in the EBX, EBP, EDX, ESI, and EDI registers
respectively. Finally, the third method uses software interrupt 128. In this method,
the system call number is stored in the EAX register, and then the six arguments are
placed in EBX, ECX, EDX, ESI, EDI, and EBP registers respectively. Since we are
only concerned with the system call that is being executed, and not the parameters
to that system call, we only need to retrieve the value of the EAX register to know
exactly which system call was performed. Thus, using ptrace for each executable,
which the user wishes to optimize, allows us to generate a total count of each system
call used.
Running the profiler with various profiles demonstrates an interesting trend. Fig-
ure 3.1 is an example profile created by running the UNIX commands du, mkdir,
and ls. Most system calls were not performed once. Some were executed once or
twice, and then some were executed over significantly more often. The table in Figure
3.1 contains 6 most invoked system calls.
Profiling the applications to be targeted, allows for creating a customized fitness
metric, which can then be used to evaluate fitness. By using the most prevalent
system calls to test fitness for each kernel, the system optimizes for the kernel space
operations used by the targeted applications.
The source for the profiler can be found in Section A.1.
Algorithm 1
Procedure Profile(T )
(∗ Keeps Track of System Call Usage ∗)
Input: Set of Applications to be Profiled T
Output: Total Count of Each System Call Occurrence
1. for i ←0 to Total Number of System Calls
2. Countsi ←0
3. for i ←0 to |T |
4. do Create New Process
5. if New Process
18
6. Ptrace System Call with PTRACE_TRACEME
7. Execute Ti
8. else
9. while true
10. do Wait for Child
11. if Child Exited
12. stop
13. else
14. Obtain Child Registers
15. tmp ←EAX
16. Countstmp ←Countstmp + 1
17. Allow Child to Continue
18. return Counts
3.2 The Genetic Algorithm

This section discusses the design decisions behind the genetic algorithm. The actual
implementation can be found in Section A.2.
3.2.1 Population Representation

As seen in Section 2.1, there are many possible methods for representing individ-
uals within the population. ACOVEA, for example, initially chose binary strings,
then moved to extensible markup language (XML) files, and finally chose an object-
oriented hierarchy. Each of these representations provides trade-offs that must be
considered when designing the population representation [16].
The first representation considered is binary strings. One advantage of binary
strings is that they are extremely compact. For n flags, only ⌈ nb ⌉ bytes, where there
are b bits in a byte, are required. Aside from being compact, binary strings also
yield well to caching. On a 32 bit architecture, typically cache lines are at least the
size of an integer, so at the very minimum 32 bits are stored in cache, thus allowing
fast access to 32 options, without requiring expensive memory access. Optimizations
such as prefetching, in which the operating system preloads pages into memory in
anticipation of use, can maximize this effect, as the entire string can typically reside
within the cache.
The second representation considered is XML. One of the largest benefits of using
XML files is persistent storage, thus allowing for the genetic algorithm to be paused
and continued dynamically during execution. This is very advantageous for a system
that is time-consuming to execute. Another advantage of XML is that it is more
accommodating to compiler flags which can exist in multiple states. For example,
with binary strings, each flag is either on or off, yet with XML, flags can exist in mul-
tiple states, such as -fp-model=strict or -fp-model=fast. This functionality
comes at the code of a significant performance overhead, especially when compared
19
to the performance of binary strings. While serialization typically is not an expen-
sive operation, deserialization can be very expensive. At the same time, care must
be taken to ensure that the XML files stored on the disk, and the objects stored in
memory are kept in sync to prevent data loss.
The final representation considered is an object-oriented hierarchy. While an
object hierarchy allows for data structure abstraction and increased extensibility, The
object-oriented hierarchy representation can also incur significant overhead compared
to the binary string representation. Expensive inheritance features, such as virtual
functions, which allow for dynamic dispatch but also force virtual function addresses
to be referenced in a vtable, slow down operations that would normally take only
a few instructions for a binary string.
Both the XML and object-oriented representations could implement the binary
string representation internally, however they both add additional overhead in provid-
ing more functionality. The representation chosen in this work was that of the binary
string, incorporating some of the benefits provided by the other representations, while
attempting to significantly reduce the overhead which accompanied those features.
Achieving data persistence without the performance penalties of XML, relies on
the POSIX mmap system call. The mmap system call allows for a file to be mapped
directly into memory. Replacing the calls to malloc with mmap when initializing the
population, allows the system to allocate memory which is backed by a file. Thus,
as the population is created and evolved, the file is updated periodically, or upon
request, with the msync system call, without a parser, which is required for XML,
or any expensive serialization and deserialization. While implementing the ability to
pause and continue the genetic algorithm during execution exceeds the focus of this
work, more information can be found in Section 5.3.1.
To provide support for compiler options that can exist in multiple states, an
overflow buffer can be allocated to each individual. This overflow buffer, whose size is
chosen based on the number of options that can exist in multiple states, contains the
possible states for each compiler flag. For example, a flag which can exist in multiple
states such as -O, which can exist in states 0, 1, 2, 3, and s, would be stored in an
overflow buffer of size 5. The first element of the overflow buffer would be -O0, the
second would be -O1, and so on.
Consider the following example, representing the compiler flags and the population
shown in Figure 3.2. Individual 0, which has a value of 0x03, is a composite of
both 0x01 and 0x02, thus this individual would represent both -axSSE3 and -vec
optimizations active. The overflow buffer for individual 0 is empty, as both of those
active flags are binary in nature. On the other hand, individual 1 has a value of
0x04, thus representing only the -fp-model compiler flag. This flag can exist in
multiple states, either -fp-model=strict or -fp-model=fast, and thus further
clarification is required for this individual. For this clarification, the overflow buffer is
referenced, which contains the specific state of the flag. For individual 1, the overflow
buffer contains -fp-model=fast, and thus this state of the optimization is used.
Individual 2 on the other hand, whose binary string is equivalent to individual 1,
20
represents fp-model=strict.
Flag Byte # Individual Overflow

-axSSE3 0x01 0 0x03
-vec 0x02 1 0x04 -fp-model=fast
-fp-model=strict 0x04 2 0x07 -fp-model=strict
-fp-model=fast 0x04 Figure 3.2: Example Population.
3.2.2 Compiler Option Initialization

During the compiler option initialization phase, the optimizations which the compiler
supports are mapped to corresponding bits, which can then be used to generate and
interpret the population.
The most robust way to determine these options is to invoke a compiler flag
which lists all options and their corresponding descriptions and states, assuming the
compiler supports such a flag. For the Intel C Compiler, this can be achieved by
passing the C compiler the -help flag [9]. Once the compiler outputs all supported
options, tools that provide lexical analysis, such as yacc and lex, allow for these
options to be parsed automatically. This methodology allows for use with almost any
compiler without program modification, however it currently exceeds the scope of this
work, which focuses on the Intel C compiler. For more information see Section 5.3.2.
Since the scope of this work is the Intel C Compiler, a file was created containing a
fairly simple grammar. Each line is either a compiler option or of the syntax “EITHER
n”, where the next n lines are compiler options which can not be used in tandem.
While it is possible to instead just assign each option a flag, and not limit which flags
are used together, this discriminates against flags which the compiler will override.
For example, if both -axSSE2 and -axSSE3 are active, the compiler will override the
option -axSSE2 with the more powerful-axSSE3 option, thus potentially preventing
the proper evaluation of the -axSSE2 option.
Algorithm 2 demonstrates how the compiler options are initialized. This must be
performed first due to the other step’s dependencies on the option count, which is
only known once the compiler options are known. At Line 6, if multiple states exist
for the option, the flag is set as NULL, thus specifying that multiple options exist in
the overflow buffer, which exists not only in the population representation, but also
in the compiler option representation.
Algorithm 2
Procedure parse_options(F )
(∗ Parses Option File into Memory for Use in Population ∗)
Output: Compiler Options to Create Population
Input: File Path F
1. Open F for Reading
21
2. i ←0
3. while Not End of File F
4. do line ←Read from F
5. if line = “EITHER”
6. then f lagi ←NULL
7. n ←Read from F
8. countsi ←n
9. for j ←0 to n
10. tmp ←Read from F
11. Add tmp to Overflow Buffer
12. else
13. f lagi ←line
14. countsi ←1
15. i ←i + 1
3.2.3 Initial Population

Once the number of compiler options is known, it is time to generate the initial
population. One of the important concerns when generating the initial population, is
how the random numbers are generated. Incorrectly generating random numbers can
lead to bias within the initial population, and further on within the genetic algorithm,
thus tainting the results.
The most basic method of generating random numbers would be to invoke the
rand function defined in the ANSI C standard. One advantage of using rand is
that the random numbers are reproducible, provided the same seed is used. While
the actual random algorithm used in rand is implementation defined by the ANSI
standard, due to this work’s focus on the Linux kernel, the glibc implementation is
compared here. The glibc rand implementation uses a linear congruential generator
algorithm [26]. As recommended by Derrick H. Lehmer, the mathematician who
proposed the linear congruential algorithm, glibc rand chooses a modulo value of
231 − 1 [21]. This choice introduces non-linearity, thus reducing the holistic effect of
the seed on the random numbers produced. However, due to this algorithmic choice,
the linear correlation between the random numbers produced makes rand ill-suited
for applications such as simulations and cryptography [26].
Another method of generating random numbers in UNIX-based operating systems,
is through the special file /dev/random, and the non-blocking version /dev/urandom.
/dev/random was designed, using SHA1 hashes, rather then traditional ciphers, to
provide true random numbers which could be used in security contexts. Importantly,
/dev/random monitors the amount of noise in the entropy pool, and will block
until additional noise is gathered, rather then returning cryptographically vulnerable
bytes. Since this work does not require cryptographically secure random numbers,
/dev/urandom, which will not block, but rather return possibly vulnerable bytes,
is used.
Many other pseudo-random number generating algorithms exist, for example Knuth’s
22
subtractive random number generator. While analysing the effect of these different
algorithms is beyond the scope of this work, see Section 5.3.3 for more information.
Algorithm 3 demonstrates how each member of the population is initialized. Take
note that on Line 3, the initial fitness is set to −1.
Algorithm 3
Procedure init_population(T, X)
(∗ Creates an Initial Population ∗)
Output: Initialized Population Ready for Evolution
Input: Population Size T , Option Count X
1. for i ←0 to T − 1
2. do
3. f itnessi ←−1
4. f lagsi ←Read X8 + 1 bytes from /dev/urandom
5. return 0
3.2.4 Fitness
Fitness evaluation is the most expensive operator in the genetic algorithm. As seen
in Subsection 3.2.3, each individual was initially assigned a fitness of −1. If any of
the individuals are modified, either by mutation or reproduction, their fitness is reset
back to −1, thus indicating their fitness needs to be reevaluated. This obviates dupli-
cate fitness evaluations between generations. Further evaluations can be avoided by
employing a global lookup table, containing the fitnesses of previously tested kernels.
This lookup table, especially if persisted in between runs of the genetic algorithm,
could greatly reduce the fitness operator’s time overhead. For more information, see
Section 5.3.4.
The first step of determining the fitness of each individual in the population is
to construct their corresponding compiler flags, thus allowing their optimization set
to build and test a kernel. To accomplish this, the individual’s bits are iterated
to determine which optimizations are active. If a flag is found to be active, the
corresponding compiler flag was concatenated onto the individual’s compiler flags.
Rather then searching for which optimization corresponded to each bit, the compiler
flags were stored in such a way that the required bit shift for the current byte plus
the total iteration counter was equivalent to the optimization’s index in the storage
array. Thus, after a bit was found to be active, searching for the corresponding string
is O(1).
Once the flags are built, the workload of building the kernels is divided between
the machines within the build-farm. First the genetic algorithm chooses the next
individual for fitness and then begins to build a kernel with the corresponding flags.
For n machines in the build farm, the next n individuals’ corresponding flags are sent
to those machines, using the transmission control protocol (TCP). TCP was chosen
as it provides reliable data transfers, with safeguards against data-loss and data-
corruption. By capitalizing on the inherently embarrassingly parallel task of building
23
separate kernels, the overall duration of the genetic algorithm can be drastically
reduced.
To actually build the kernels, a child process is created for each machine in
the build farm. To create these processes, the POSIX standard fork, execl and
waitpid system calls are used. The fork system call creates a new child process
which inherits an exact copy of the memory image of the parent process. The execl
system call, then replaces that memory image, thus executing a different executable,
in the specified environment. Finally, the waitpid system call allows for a process
to block until another process finishes, thus allowing process synchronization. While
fork technically requires that the entire address space of the parent is copied for the
child, the copy-on-write optimization used within the Linux kernel allows the child to
access the parent’s address space until the child attempts to write changes to that ad-
dress space, at which point the parent’s address space is copied into a unique address
space, for the child to modify. Since, execl is called immediately after the child is
forked when child processes are created to execute kernel build tasks, no changes are
made to the parent’s address space, and thus no address space copy occurs, leaving
only the penalty of duplicating the parent’s page tables for the child. Before this
optimization was implemented in the kernel, vfork was used to achieve similar re-
sults. When initially forking the children to build the kernels and interact with the
build farm, each child does alter the parent’s address space, and thus incurs the copy
penalty, however, the parent’s address space can be pruned with the madvise system
call before the invocation of fork.
Each child sends the corresponding compiler flags to the build farm, and then
awaits a response containing the fitness score. The parent process, rather then sending
the compiler flags to a build farm, compiles the kernel locally. To build the kernel, the
parent process forks a child to perform each kernel build task, and then synchronizes
the two processes using waitpid, so that in cases where a job depends upon the
previous step, no dependency violations occur. Algorithm 4 demonstrates the fitness
evaluation function. At Line 1, a child process is created for each machine within the
builder server. While this is performed, the machine running the genetic algorithm
also builds the kernel, using the build rules, clean, oldconfig, bzImage, and then
creates an archive of the necessary files to install the produced kernel. The first build
rule executed is clean. This rule deletes the files created from the last kernel build.
This reduces the chances of cross-contamination, the kernel using already generated
files from the last compilation. After this, the build rule oldconfig updates the
.config file, which contains the kernel’s configuration and settings. This process
generates some files which are necessary for the kernel image to build properly.
24
Algorithm 4
Procedure fitness(P )
(∗ Evaluates the Fitness of Each Individual ∗)
Output: Population with New Fitness Values
Input: Population with Some Individuals Requiring Fitness Evaluation
1. for i ←0 to Number of Build Machines − 1
2. do
3. Fork Child
4. if Pid of Child
5. then
6. Send Individual Flag to Builder
7. Await Fitness Response
8. Fork Child
9. if Pid of Child
10. then
11. make clean
12. Waitpid Child
13. Fork Child
14. if Pid of Child
15. then
16. make oldconfig
17. Waitpid Child
18. Fork Child
19. if Pid of Child
20. then
21. make bzImage
22. Waitpid Child
23. Fork Child
24. if Pid of Child
25. then
26. Create tar archive of bzImage and System.map
27. Waitpid Child
28. Wait To Sequentially Receive Fitness for Each Kernel Created by Children
29. Send Kernel Archive Created by Parent to Testing Farm
30. Wait To Receive Fitness For Parent Kernel
Once this task has completed, the actual compressed kernel image is built. The
build rule bzImage, along with the flags HOSTCFLAGS and HOSTCXXFLAGS set
to the individual’s corresponding compiler options, as well as AR set to xiar, the Intel
archiver, and LD set to xild, the Intel linker, is invoked. This process constructs the
compressed kernel image, which is now ready for installation on the target system.
Typically, the next step would be to build the kernel modules, however since the
modules are never modified, compiling them at every iteration wastes valuable time.
Therefore, the kernel modules are separately built and installed on the testing farm
25
machines before the genetic algorithm is run.
Now that the kernel is built, it is ready for installation. Typically, to install the ker-
nel, the build rule install is invoked, however, this requires archiving, and ideally
compressing, the entire kernel source in preparation for network transfer. Regardless
of the compression algorithm used, including bzip2, gzip, lzop, and lzma, these tasks
are very time-consuming, and can consume up to 30 minutes. To remedy this, only
the necessary parts of the kernel are archived, the compressed kernel image located at
arch/x86/boot/bzImage, and the System.map, located in the root source direc-
tory. At boot, the compressed kernel image is decompressed and executed. Due to the
differing kernel structures, the address of each function in the kernel is contained in
the System.map file. By targeting only these two files, the entire process of preparing
the kernel for installation is reduced to just under 1 minute. Along with exploiting
the parallelism in GNU Make by using the -j n flag, where n jobs are executed in
tandem, the entire preparation process takes approximately 6 minutes depending on
disk latency, processor speed, and the selected compiler optimizations.
3.2.5 Reproduction, Mutation, and Selection

Reproduction occurs by taking the higher bits of the first parent, and the lower bits
from the second parent and combining them using a bitwise or operator. Mutation
occurs by reading a random byte from /dev/urandom and then using a bitwise or
operator to combine it with a random byte from the chosen individual. There is a 2%
chance of mutation occurring. Originally it was intended for there to be a 10% chance
of mutation, however according to the results of ACOVEA, high levels of mutation,
such as those above 2%, conceal evolution [16]. The selection operator currently uses
truncation, to determine which inhabitants should reproduce, and which ones should
die off. The selection operator chooses only the top 75% of the population to continue
to the next generation. This elitism supports strong kernel builds at each stage of
the algorithm, so beneficial optimizations will be passed down into the weaker kernel
builds.
3.3 Build Farm

The only purpose of the build farm is to build kernels, capitalizing on the inherent
parallelism of building multiple unique kernels. Each machine in the build farm listens
on a designated port to receive kernel flags. Once flags are received, the kernel is built
in the same process used by the parent process in Algorithm 4, except the build farm
does not wait for fitness scores. Instead, after the kernel is built, archived, and sent
to the testing farm for fitness evaluation, the build farm returns to the listening state.
There are two primary reasons for not receiving the fitness score at the build farm.
First, by bypassing the build farm, less network transfers are required, and thus less
delay is caused by network latency. Secondly, by immediately returning to a listening
26
state, more kernels can begin to build, while the previous set of kernels are still under
evaluation.
The implementation of the build farm can be found in Section A.3.
3.4 Testing Farm

The testing farm consists of machines that will benefit from the optimized kernel
produced by the genetic algorithm. The purpose of these machines is to perform
micro-benchmarks with each individual’s corresponding kernel, and then assign a
fitness which is then sent back to the genetic algorithm. To automatically run the
client application, an entry is added to /etc/rc.sysinit to begin the application
in the background once the network interfaces are up. This is required, as each
machine must be able to receive kernels and send fitness values.
Each machine exists in one of two states, running or waiting. Because the current
state must persist between system reboots, the waiting state is signified by the pres-
ence of a file located at /.ga/wait. The running state is signified by the lack of
the wait file.
In the waiting state, the machine is listening on a designated TCP port for an in-
coming kernel. Once the incoming kernel is received, it is unpacked, and then installed
using the kernel installation script at /sbin/installkernel, which is available
on all Red Hat based systems. This script takes three parameters, the kernel version,
the location of the boot image, and the location of the mapfile. The kernel version
corresponds to the name of the kernel modules folder installed in /lib/modules.
The boot image is the bzImage created by building the kernel. The mapfile is the
System.map file built with the kernel. This script first creates a backup of the pre-
vious kernel image with the specified kernel name. Then it moves the kernel image
and System.map into the /boot directory, and finally updates either GRUB or
LILO to reflect the newly installed kernel. After the new kernel is installed, the state
is changed to running, and the system to rebooted.
In the running state, clock_gettime is used to time a set of system calls,
using the real-time system clock. While glibc provides wrapper functions around
each system call, invoking these wrappers can cause unnecessary overhead, such as
checking the return value and setting errno appropriately. To bypass this, inline
assembler is used. However, only using the __asm__ keyword in GCC is not enough,
as the compiler may still attempt to optimize or place the instructions in different
locations. To ensure the compiler does not make any changes to the inline assembler,
the __volatile__ keyword is used. Listing 9 demonstrates timing the mmap2
system call. First, a handle to the system clock must be obtained, then, using that
clock handle, the time is taken before and after the system call.
27
Listing 9 Timing the mmap2 System Call.
clockid_t cl;
struct timespec start, end;
void* buffer;
clock_getcpuclockid(0, &cl);
clock_gettime(cl, &start);
__asm__ __volatile__ (
"movl $192 ,%%eax\n\t"
"movl $0 ,%%ebx\n\t"
"movl $8192,%%ecx\n\t"
"movl $0x3 ,%%edx\n\t"
"movl $0x22,%%esi\n\t"
"movl $-1 ,%%edi\n\t"
"movl $0 ,%%ebp\n\t"
"int $0x80 \n\t"
"movl %%eax, %0"
: "=m" (buffer)
:
: "%eax", "%ebx", "%ecx", "%edx", "%esi"
"%edi", "%ebp");
clock_gettime(cl, &end);
After all system calls have been timed, the next step is to calculate the fitness
score. Algorithm 5 demonstrates how the fitness score is derived once all system calls
considered are timed. So that higher fitness scores are better, the fitness starts at
the largest value possible and then decreases the fitness score according to the total
measured system call duration.
Algorithm 5
Procedure Calculate_Fitness(T )
(∗ Calculates the Fitness Score ∗)
Input: Total Time For System Calls To Execute T
Output: Fitness Score n
1. n ←INT_MAX
2. for i ←0 to Number of Seconds in T
3. do n ←n − 10000000
4. n ←n − Number of NanoSeconds in T
5. return n
Implementation of the testing farm can be found in Section A.4.
28
Chapter 4
Results
“An algorithm must be seen to be believed”

– Donald Knuth
To evaluate the effectiveness of the designed genetic algorithm, first we experiment

with a few different conditions. After analyzing the kernels produced, they are com-
pared with the default Fedora stock kernel, compiled with the GNU C Compiler, and a
kernel compiled with the Intel C Compiler, without any optimizations. By comparing
the evolved kernels with the stock Fedora kernel, 2.6.27.19-170.2.35.fc10.i686,
we determine whether any forward progress has been made in evolving a faster ker-
nel. By comparing the evolved kernels with the unoptimized kernel built with the
Intel Compiler, we gauge how much of that progress resulted from evolution, and how
much of that progress resulted from a superior compiler.
4.1 Configurations
The genetic algorithm was run in three unique configurations. Table 4.1 demonstrates
these configurations. For all three configurations, a population size of ten was used.
As discussed in Section 2.3, Cooper et. al. found that population sizes over twenty did
not provide better results than population sizes under twenty [4]. The population of
twenty was reduced to ten for these experiments due to time constraints. By reducing
the population size to ten, more time was available for testing different configurations.
Two of the three configurations tested ran for five generations. The other config-
uration ran for ten generations. The primary test was run 0, and the other two were
modifications of that run. Run 1 increased the generation count to ten, and added
two more system calls, to gauge the effect of these changes, compared to run 0. Run
2 maintained the same number of generations as run 0, but added more system calls.
The system calls chosen were determined by the application profiler discussed in
Section 3.1. The application profiler analyzed the UNIX commands ls, du, and
mkdir. The results from the profiler can be found in Table 4.2. Out of the 300+
system calls, only the top 12 are listed, all other system calls had counts in the single
digits, with most at 0. From this table, open, close, read, write, fstat64,
mmap2, and munmap were chosen for use.
# Population Generations System Calls
0 10 5 Open, Close, Read, Write, Fstat64
1 10 5 Open, Close, Read, Write, Fstat64, Fork
2 10 10 Open, Close, Read, Write, Fstat64, Fork, Mmap2, Munmap
Table 4.1: Genetic Algorithm Configurations.
Call Name Occurrence Count

open 196
stat64 133
mmap2 52
close 40
fstat64 31
getxattr 28
read 27
write 20
munmap 14
lstat64 14
lgetxattr 14
mprotect 13
Table 4.2: Application Profiler Results.
4.2 Resulting Kernels

Table 4.3 shows the highest fitness value achieved for each configuration. The con-
figuration labeled Default represents the fitness obtained by timing each system call,
all the system calls from run 2, and then running the fitness calculation algorithm, as
described in Section 3.4. The configuration labeled ICC represents the same process
run on kernel with no optimizations built with the Intel C Compiler. Except for run 1,
each of these fitness values were the best fitness value after the algorithm had ended.
In run 1, the top fitness value was achieved a generation before finishing, however,
this top value was selected for mutation and the resulting fitness after the mutation
was lower then the previous value.
Configuration Highest Fitness

0 2148607860
1 2147353727
2 2147485533
Default 2147144574
ICC 2146752119
Table 4.3: Highest Fitnesses For Each Configuration.
30
All three evolved kernels resulted in a higher total fitness then both the Fedora
default kernel and the ICC kernel. Interestingly, the ICC kernel scored a lower fitness
value then the default Fedora kernel. The Fedora kernel was most likely built with
either -O2 or -O3, as opposed to the ICC kernel built without any optimizations.
While this may seem like an unfair comparison, it provides a base comparison for the
flags the genetic algorithm chose.
While the evolved kernels have a higher fitness value, they must be analyzed and
benchmarked to determine whether this increased fitness corresponds to an increase
in performance.
4.3 Produced Kernels

The kernel produced by run 0 was built using the flags -axSSE3, -mia32,
-fomit-frame-pointer, -fnon-call-exceptions, -funroll-loops, and
alias-const. These compiler optimization produced the kernel with the highest
fitness. The first option, -axSSE3, generates processor-specific auto-dispatch code
paths, but only if the compiler can detect an opportunity for performance gain. Due
to the processor-specific nature of this optimization, the kernel can only be used
on platforms which support SSE3. While the Intel Atom technically has its own
specialized SSE category, SSE3_ATOM, that category is not a valid option for this
flag, and thus the SSE3 is a good choice, as any lower category will not take advantage
of certain SSE features available on the Atom, and any higher categories risk executing
illegal instructions. The next flag -mia32 is an interesting choice, as typically, this
optimization informs the compiler to generate code for any Intel IA-32 processor,
however when used in conjunction with the -ax flag, the compiler produces additional
specialized code paths optimized for the SSE category specified in -ax. [9].
The kernel produced by run 1 was built using the flags -axSSE3, -fexceptions,
-funroll-loops, -scalar-rep, -alias-const, -fargument-noalias-global,
-opt-multi-version-aggressive, -vec, -opt-malloc-options=3,
-use-intel-optimized-headers, -fp-speculation=strict, -prec-sqrt,
-prec-div, -fast-transcendentals, -fp-port, -rcd, -ftz,
-inline-forceinline, -Zp16, -no-bss-init, and -falign-functions=16.
Again, the -axSSE3 flag is present, however this kernel drastically differs from the
kernel from run 1, as optimizations such as -vec, which enables vectorization, and
-Zp16 and -falign-functions=16, which set both structure and function align-
ment to 16 bytes, are active. Seven of the optimizations deal with floating point
calculations, which were not evaluated by the fitness function, so their impact on
kernel performance will only be visible to floating point tests during the applica-
tion benchmark. An interesting combination, and possibly one of the reasons for
the lower fitness score, is -ftz and -axSS3. The compiler flag -ftz attempts to
improve performance by flushing denormal results to zero in the gradual underflow
mode. While this option is intended to improve performance, if used in conjunction
with -ax, it inserts code to conditionally set the FTZ and DAZ hardware flags. This
31
added overhead on SSE3 instructions may be responsible, among other factors, for a
lower fitness score [9].
The kernel produced by run 2 was built using the flags -mcpu=pentium4,
-march=pentium4, -scalar-rep, -alias-const, -fargument-noalias-global,
opt-ra-region-strategy=default, -vec, -par-schedule=auto,
-fast-transcendentals, -fp-port, -rcd, -ftz, -inline-level=2, -finline,
-Zp16, -align, -falign-functions=16, -falign-stack=assume-16-byte.
Instead of performing SSE optimizations like the other two kernels, the first two flags,
-mcpu=pentium4 and -march=pentium4, tune the compiler flags for the Pen-
tium line of processors. While these optimizations are probably somewhat beneficial,
they will not fully utilize any architecture specific features introduced after the Pen-
tium series of Intel processors, and thus will not take full advantage of the Intel Atom.
Just as in the last function, the compiler options show favor to 16 byte alignment.
The option -par-schedule=auto allows either the compiler, or run-time libraries,
to determine the best scheduling algorithm for parallel loop iteration. This can be
beneficial in some situations, or an added overhead in others, depending on what
scheduler is chosen and how it is chosen.
While these observations are merely speculations based on fitness scores and com-
piler optimizations, the real performance analysis occurs at the user space level, where
application benchmarks are run to characterize the performance of the kernels.
4.4 Application Benchmarks

To test how the kernel optimizations affect user space applications, user space appli-
cation benchmarks are used. To perform the application benchmarks, the Phoronix
Test Suite is used. Phoronix is a framework consisting of multiple test suites, each
test suite made up of various tests to categorize system performance [18].
Since this work focuses on netbooks, the Phoronix netbook test suite was used.
The netbook test suite attempts to simulate typical netbook workloads, such as file
compression and decompression, media encoding, and GUI interactions. The netbook
test suite consists of eleven different applications. Table 4.4 shows the tests within
the netbook test suite. Each test is executed multiple times and the average is used
as the total score for each category.
32
Test Description
LAME Wav to MP3 Encoding Using LAME
OGG Wav to Ogg Encoding
FFMpeg AVI to VCD using FFMpeg
7Zip 7-Zip File Compression
Scimark0 Composite Scientific Computation
Scimark1 FFT Calculation
Scimark2 Monte Carlo Simulation
IOZone0 512MB Write Performance
IOZone1 512MB Read Performance
IOZone2 1GB Write Performance
SqlLite 12, 500 Inserts to SQLite DB
GNUPG 2 GB File Encryption
CRay Ray Tracing
Ram Integer Add
GTKPerf0 GtkComboBox
GTKPerf1 GtkPixBuf
GTKPerf2 GtkRadioButton
Table 4.4: Phoronix Netbook Test Suite.
33
The Phoronix netbook test suite was run on each kernel considered. Each of these
tests provides unique insight into which kernels were best suited for certain tasks.
Also considered was a kernel built with the Intel C compiler with no optimizations.
This allows for a comparison of the compiler options the genetic algorithm chose.
4.4.1 Tests with Improved Performance

Of the 17 tests evaluated in the Phoronix netbook test suite, 11 saw increased per-
formance using the evolved kernel, when compared to the stock Fedora kernel.
LAME Test.
The first test performed was the LAME test, where, using the LAME library, a
WAV file was encoded as an MP3. Table 4.5 displays the results. All of the kernel
performed in a similar manner, however the stock Fedora kernel performed slightly
slower then the ICC and evolved kernels. Both kernels from run 1 and run 2 were
tied for fastest, however the ICC kernel was in a close second, and the kernel was run
0 was third slowest. One common trend between both the run 1 and run 2 kernels
was their 16 byte alignment and vectorization optimizations, which may have given
them the upper hand in this test.
Kernel Average Duration (sec)

ICC 162.13
Run 1 162.12
Stock 163.03
Run 0 162.57
Run 2 162.12
Table 4.5: LAME Test Results.
OGG Test.
The next test performed was the OGG test, where a WAV file was encoded as an
OGG file. Table 4.6 displays the results. The fastest kernel was the run 0 kernel, and
again the slowest kernel was the stock Fedora kernel. The combination of -axSSE3
and -mia32 might have been responsible for this performance increase, however that
is purely speculation.

ICC 107.98
Run 1 107.93
Stock 108.47
Run 0 107.77
Run 2 107.66
Table 4.6: OGG Test Results.
Scimark.
34
The next three these are performed using SciMark, a Java benchmark for testing
scientific and numerical computing. This first test, SciMark0, computes five compu-
tational kernels, an FFT kernel, a Gauss-Seidel relaxation kernel, a Sparse matrix-
multiply kernel, a Monte Carlo integration kernel and a dense LU factorization kernel
[22]. The test compares the combined composite score of all of these kernels. Table
4.7 shows the results of this test. The fastest kernel was produced by run 0. The
second fastest kernel was produced by run 2. The stock Fedora kernel was the slowest
of the kernels tested.
ICC 120.42
Run 1 120.4
Stock 120.74
Run 0 120.03
Run 2 120.1
Table 4.7: Scimark0 Test Results.
The next Scimark test, SciMark1, only compares the performance of the FFT
kernel [22]. As seen in Table 4.8, the ICC kernel with no optimizations performed the
best here. The stock Fedora kernel performed substationally worse then the others,
taking a full 2 seconds more.

ICC 20.04
Run 1 20.34
Stock 22.63
Run 0 20.08
Run 2 20.34
The last SciMark test, SciMark2, compares the performance of only the Monte
Carlo integration kernel [22]. The results are shown in Table 4.9. For this test,
the kernels performed very similarly, with almost no difference between the kernels,
perhaps because this task did not interact with the kernel often. The stock Fedora
kernel took the longest to complete the computations.
Sqllite.
Sqllite is a lightweight file-based database system. The Sqllite benchmark eval-
uates the performance of 12, 500 inserts. Table 4.10 displays the results. In this
test, the stock Fedora kernel performed significantly worse then the other Intel-built
kernels. The kernel produced by run 2 outperformed all of the other Intel-built ker-
nels by 1 to 2 seconds. This result may be caused by the Intel-built kernels using
a different IO scheduler then the Fedora kernel. The Fedora kernel, which was con-
figured by the Fedora kernel developers, uses the Complete Fair Queueing Scheduler
35
ICC 41.65
Run 1 41.81
Stock 41.87
Run 0 41.84
Run 2 41.84
(CFQ), while the kernels built with the Intel compiler were configured to use the An-
ticipatory IO Scheduler (AS). The Anticipatory IO Scheduler is optimized to avoid
disk head movements, which can be ideal in mobile devices, in an attempt to reduce
power-consumption.

ICC 63.69
Run 1 64.33
Stock 202.18
Run 0 63.93
Run 2 62.39
Table 4.10: Sqllite Test Results.
GNUPG.
The GNU Privacy Guard (GNUPG) test evalutes encrypting a 2 GB file using
GNUPG. The results are visible in Table 4.11. The slowest kernel was the stock
Fedora kernel. The fastest kernel was the kernel produced by run 0. This is probably
a result of SSE3 optimizations, but that is merely speculation.

ICC 162.45
Run 1 163.19
Stock 169.01
Run 0 162.17
Run 2 163.54
Table 4.11: GNUPG Test Results.
Cray.
The C-ray test measures the time required to ray-trace a scene. The results from
this test can be found in Table 4.12. The fastest kernel produced was the kernel from
run 1. The slowest kernel was the stock Fedora kernel. In general, ray-tracing tends
to be a highly parallel task, which can benefit from architecture parallelism the Intel
C compiler can exploit.
GTKPerf0.
36
ICC 2564.34
Run 1 2563.4
Stock 2570.03
Run 0 2564.19
Run 2 2563.91
Table 4.12: Cray Test Results.
The next three tests perform benchmarks using the GTK graphic library, which
is the basis of the popular GNOME desktop. The first test performs a series of
operations on a GtkComboBox. Table 4.13 displays the results. The slowest kernel of
the group was the Stock Fedora kernel. The fastest of the group was kernel produced
by run 1.

ICC 148.14
Run 1 146.95
Stock 163.89
Run 0 151.37
Run 2 149.88
Table 4.13: GTKPerf0 Test Results.
GTKPerf1.
The next GTK test was a benchmark of the GTK PixBufs. Table 4.14 displays
the results. The fastest kernel in this test was the stock Fedora kernel. The slowest
was run 1.

ICC 57.52
Run 1 57.71
Stock 23.65
Run 0 56.63
Run 2 57.19
GTKPerf2.
The final GTK test was a benchmark of the GTK radio button. Table 4.15 displays
the results. The fastest kernel in this test was the kernel produced in run 0. The
slowest kernel produced was the stock Fedora kernel.
37
ICC 26.93
Run 1 26.65
Stock 27.34
Run 0 26.52
Run 2 26.57
4.4.2 Tests with Decreased Performance

5 out of 17 tests resulted in decreased performance, when compared to the stock
Fedora kernel. 4 out of those 5 tests, saw improved performance from the evolved
flags, when compared to the ICC kernel.
7-Zip Test.
The 7-zip test benchmarked compressing a file using 7-zip compression. In this
test, the stock Fedora kernel out performed all of the other kernels. The second best
kernel was the ICC kernel with no optimizations, and it appeared that all optimiza-
tions chosen negatively impacted this benchmark’s performance.

ICC 858
Run 1 864.33
Stock 842.66
Run 0 860
Run 2 862
Table 4.16: 7zip Test Results.
IOZone0.
The next three tests are performed using the IOZone filesystem benchmark. As
seen in Table 4.17, Table 4.18, and Table 4.19, The Fedora kernel was the fastest
kernel for all IOZone tests by a significant margin.
The first test performed benchmarked 512 MB write performance. As an in-
teresting observation, the worst kernel was the ICC kernel with no optimizations.
Comparing this kernel with the results of the three runs, it is apparent that in each
case, the optimizations chosen by the genetic algorithm did improve performance.
IOZone1.
In the next IOZone test, 512 MB read performance is benchmarked. Just like
the results for the first IOZone test, the worst kernel was the ICC kernel. When
compared to the ICC kernel, each of the kernels evolved by the genetic algorithm
improved performance.
IOZone2.
For the last IOZone test, 1 GB write performance was benchmarked. Yet again,
38
ICC 54.8
Run 1 53.3
Stock 36.46
Run 0 51.95
Run 2 52.48
Table 4.17: IOZone0 Test Results.

ICC 538.76
Run 1 520.11
Stock 360.67
Run 0 531.51
Run 2 523.81
the trend continues from the previous two IOZone tests. The ICC kernel was the
worst, and the three evolved kernels improved performance, in comparison.

ICC 48.88
Run 1 44.69
Stock 33.71
Run 0 45.01
Run 2 43.96
While there is no empirical evidence to prove this, it is the author’s opinion that
these performance results occurred due to the difference in IO scheduler configurations
discussed earlier in regards to the Sqllite test.
Ram.
The RAMspeed test performs integer addition in memory, to benchmark the per-
formance of memory access. Table 4.20 shows the results. The slowest kernel pro-
duced was the kernel from run 1, and the Fedora kernel was the fastest. In this case,
the ICC kernel was the next fastest kernel, thus indicating that all options chosen by
the genetic algorithm reduced performance in this case.
4.4.3 Tests with Static Performance

1 of the 17 tests saw no real improvements, regardless of which kernel was evaluated.
FFMpeg Test.
39
ICC 2006.93
Run 1 2021.47
Stock 1888.01
Run 0 2006.98
Run 2 2018.61
Table 4.20: Ram Test Results.
In the FFMpeg test an AVI file was converted to an NTSC VCD file. Table 4.21
displays the results. While all of the kernels were very close, run 0 was slightly faster
then the rest, and the slowest was the stock Fedora kernel. This lack of variety within
the test results hint that this benchmark did not interact with the kernel often.
ICC 94.43
Run 1 94.47
Stock 94.52
Run 0 94.20
Run 2 94.17
Table 4.21: FFMpeg Test Results.
4.5 Results Summary

Figure 4.1 depicts the performance of each kernel for each test. For 65% of the test
cases, the evolved compiler flags outperformed the stock Fedora kernel. As seen in the
Phoronix test results, these kernel optimizations visibly improve the performance of
user space applications. For example, the GTK tests shows that kernel optimizations
can make the GNOME desktop environment more responsive, or in the case of the
GNUPG test, decrease the duration of encrypting a file. The highest performing
kernel resulted from configuration 0, which was the fastest kernel in the most tests.
One reason the compiler flags chosen by run 0 were so superior is the presence of the
-axSSE3 flag in conjunction with the -mia32 flag. This is a primary example of the
genetic algorithm finding compiler options which work well together. The explanation
for why run 1 and 2 performed worse stems from the additional system calls added
to configuration 1 and 2. Both the mmap and fork system calls are implemented
with the copy-on-write implementation as discussed in Section 3.2.4. Therefore, for
these system calls, just because the call has finished does not guarantee that the
overhead of the system call has been invoked. Since no data was written to either
the mmaped page, or the forked child, after the call, neither of these necessarily were
properly evaluated. These system calls then added noise to the fitness operator, and
thus reduced the effectiveness of the evolution process.
40
Figure 4.1: Phoronix Benchmark Results (Smaller Is Better).
4.6 Threats To Validity

A few items need considered when parsing these results. The first consideration is
the version of the Linux kernels compared. The Fedora kernel version was a patched
version of Linux 2.6.27.19-170.2.35 developed by the Fedora project, whereas
the Intel-built kernels were a patched version of vanilla Linux 2.6.30.5. So the
Intel-built kernels used a more recent version of Linux, which may affect the perfor-
mance results. In the author’s opinion, this difference is minuscule, however it needs
to be noted. Another difference between the two kernels was their configurations. The
kernel configured by the Fedora project kernel team was configured to run on a wide
range of processors and platforms, and therefore might not contain as aggressive of
optimizations as the Intel-built compilers. Finally, due to the requirement of an X11
server for the Gtk portion of the test, the Phoronix test suite was run in graphical
user mode, and thus the results might be different without the test suite running as
a child of X11.
41
Chapter 5
Conclusion and Future Work
“In the future, computers may weigh no more then 1.5 tonnes.”
– Popular Mechanics, 1949
5.1 Conclusion
This work accomplished exactly what it intended to do. The genetic algorithm was
able to effectively find a set of compiler optimizations that interacted well and yielded
better performance then the stock Fedora kernel on the Intel Diamondville platform.
Now that this system has been a successful proof-of-concept, it is time to expand
to more platforms, and more compilers, while also expanding the system features to
eventually meet the needs of the home user, who wishes to automatically optimize
his kernel, or to the chipset manufacturer, who wishes to tune vendor compilers and
provide optimizations strategies to developers.
5.2 Contributions
This system is not going to change the way millions of people compile their kernels.
What it does contribute is a proof-of-concept. It shows that genetic algorithms can
not only improve user space executable performance, but also can be expanded to
improve kernel performance. It also reinforces the claim that genetic algorithms accel
at finding correlations between options for optimization problem and motivated future
work.
5.3 Future Work

Due to the success of this proof-of-concept, the future holds many exciting prospects
for continued research. This section attempts to highlight some desirable key areas
for future research.
5.3.1 Continuations
Due to the time-consuming nature of running the genetic algorithm, a desirable fea-
ture to implement is continuations. Continuations allow for dynamically pausing and
restarting the genetic algorithm. To achieve this, persistent storage for both the pop-
ulation and compiler options needs to be implemented. The ideal way to achieve this
is discussed in the Section 3.2.1.
5.3.2 Dynamic Option Parsing

Dynamic Option Parsing using lexical analysis tools such as lex and yacc would
expand support for future compilers and versions, without forcing manual changes to
the compiler options file. Freedom from dependencies on any compiler will also yield
more portability and allow for the expansion to other platforms.
5.3.3 Random Number Evaluation

Future work might consider the impact of various different random number generators
on the overall output of the system. This could include adding inherent bias towards
features the user wishes to test further, or believes will result in better kernels.
5.3.4 Persistent Fitness Results

Another desirable implementation feature to alleviate the time-consuming nature of
running the genetic algorithm is to keep a persistent database of all previously cal-
culated fitness scores. This memoization could greatly increase the run time per-
formance of the genetic algorithm as the genetic algorithm is run more and more
often.
5.3.5 Profiles, Generations, Population Size

The most important future work involves expanding the experimentation to optimize
towards more application profiles, and to test various population sizes and generation
lengths, in an attempt to evolve even better compiler flags.
43
Appendix A
Code Listings
A.1 Profiler Code

1 /**
2 * \file profile.c
3 * \author Jim Kukunas
4 * \brief
5 * Traces the benchmarks, given as command-arguments,
and record the
6 * total system call count. It then calculates the
frequency and
7 * weights, to determine the fitness evaluation that
should be used
8 *
9 */
10
11 #define _BSD_SOURCE
12
13 #include<sys/ptrace.h>
14 #include<sys/types.h>
15 #include<sys/wait.h>
16 #include<sys/mman.h>
17 #include<sys/user.h>
18 #include<unistd.h>
19 #include<sys/syscall.h>
20 #include<stdio.h>
21 #include<stdlib.h>
22 #include<string.h>
23
24 /**
25 * \struct call
26 * Holds the name and total of system calls
27 */
28 struct call
29 {
30 char* name; /** \brief the name of the system call
*/
31 int counts; /** \brief the total number of calls
throughout the
32 * benchmarks
33 * */
34 long int num; /** \brief the system call number (eax
) */
35 };
36
37
38 int main(int argc, char** argv)

39 {
40 pid_t child, child1;
41 int status, count, total;
42 FILE* syscalls, *results, *script;
43 struct call* calls;
44 char* line= NULL;
45
46 if(argc < 3) {
47 fprintf(stderr, "Usage:\n%s <syscall file> <
benchmark 0>"
48 " ... <benchmark N>\n", *
argv);
49 return EXIT_FAILURE;
50 }
51
52 syscalls = fopen(*(argv+1), "r");

53
54 if(syscalls == NULL) {
55 perror("Opening System Call File");
57 }
58
59 count = 0;
60 line = malloc(BUFSIZ);
61 if(line == NULL) {
62 perror("Allocating Memory");
63 fclose(syscalls);
45
65 }
66
67
68 total = 512;
69 calls = malloc(sizeof(struct call)*total);
70
71 if(calls == NULL){
72 perror("Allocating Memory for System Calls")
;
73 free(line);
76 }
77
78
79 while(!feof(syscalls)) {
80 fscanf(syscalls, "%s %li\n", line,&((count+
calls)->num));
81 (count+calls)->name = malloc(strlen(line));
82 strcpy((count+calls)->name, line);
83 (count+calls)->counts = 0;
84 if(count == total-1) {
85 total += 256;
86 void* new = realloc(calls,
87 sizeof(struct call)*total);
88 if(new == NULL) {
89 perror("Reallocating more
memory"
90 "for the system calls
");
91 for(count = 0; count < total
-256;count++){
92 if( (calls+count)->
name != NULL) {
93 free( (calls
+count)->
name);
94 }
95 }
97 free(line);
98 free(calls);
46
100 }
101 else {
102 calls = new;
103 }
104 }
105 count++;
106 }
107
108 free(line);
110 madvise(calls, sizeof(struct call) * total,
MADV_DONTFORK);
111
112 for(count = 2; count < argc; count++) {

113 child = fork();
114 if(child == -1) {
115 perror("Forking a new process");
116 }
117 if(child == 0) {
118 char* name = strtok(*(argv+count), "
"),
119 ** args = malloc(5),
120 * token;
121 int k=1;
122 for(int i=0; i < 5; i++) {
123 *(args+i) = malloc(80);
124 }
125
126 args[0] = name;

127
128 token = strtok(NULL, " ");

129 while(token != NULL) {
130 strcpy(*(args+k), token);
131 k++;
132 token = strtok(NULL, " ");
133 }
134 args[k] = NULL;
135
136 ptrace(PTRACE_TRACEME, 0, NULL, NULL

);
137 execvp(name, args);
138
139 }
47
140 else {
141 struct user_regs_struct regs;
142 while(1) {
143 wait(&status);
144 ptrace(PTRACE_SYSCALL, child
, 0, 0);
145 wait(&status);
146 ptrace(PTRACE_GETREGS, child
,0, &regs);
147 for(int i =0; i < total; i
++) {
148 if( (calls+i)->num
==
149 regs.
orig_eax)
{
150 (calls+i)->
counts++;
151 break;
152 }
153 }
154 ptrace(PTRACE_SYSCALL, child
, 0,0);
155 if(WIFEXITED(status)) {
156 break;
157 }
158 }
159 }
160 }
161
162 results = fopen("results", "w+");

163 if(results == NULL) {
164 perror("Opening Results File");
165 results = stdout;
166 }
167
168 fprintf(results, "System_Call,Counts\n");

169 for(count = 0; count < total; count++) {
170 if( (calls+count)->name == NULL) {
171 break;
172 }
173 else {
174 fprintf(results, "%s,%i\n",
48
175 (calls+count)->name,
176 (calls+count)->counts);
177 }
178 }
179
180 script = fopen("proc.RSCRIPT", "w+");

181 fprintf(script, "data <- read.csv(\"results\",
header=TRUE)\n"
182 "total <- sum(data$Counts)\n"
183 "data$Counts <- data$Counts/total\n"
184 "sd <- data[order(data$Counts,
decreasing=TRUE), ] [1:15,]\n"
185 "require(plotrix)\n"
186 "radial.plot(sd$Counts, labels=
sd$System_Call, rp.type=\"r\","
187 "line.col=27, lwd=2, main=\"Top 15
System Call Frequency\")\n"
188 "write.table(sd$System_Call, file=\"
rout.dat\")\n"
189
190 );
191
192
193 fclose(results);
194 fclose(script);
195 sync();
196
197 child1 = fork();

198
199 if(child1 == -1) {

200 perror("Creating Child for RScript");
201 }
202 if(child1 == 0) {
203 execl("/usr/bin/Rscript", "Rscript", "proc.
RSCRIPT", NULL);
204 }
205
206
207 for(count = 0; count < total;count++){

208 if( (calls+count)->name != NULL) {
209 free( (calls+count)->name);
210 }
211 }
49
212 free(calls);
213 }
1 cc = gcc
2 out = profiler
3
5 build_debug: profile.c;
6 ${cc} profile.c -g -std=c99 -Wall -pedantic -o ${
out}
7 build: profile.c;
8 ${cc} profile.c -std=c99 -O2 -o ${out}
A.2 GA Code
A.2.1 main.h
1 /**
2 * \file main.h
3 * \author Jim Kukunas <jkukunas@acm.org>
4 * \date 02/2010
5 * \brief Function declarations to create and run the system
6 *
7 * \mainpage Senior Thesis
8 *
9 * \section Introduction
10 * This project ...
11 *
12 */
13
14 #ifndef _MAIN_H
15 #define _MAIN_H
16
17 #define _POSIX_C_SOURCE 199309L

18 #define _GNU_SOURCE
19
21 #include<stdio.h>
23 #include<stdint.h>
24 #include<math.h>
50
28 #include<sys/stat.h>
29 #include<fcntl.h>
31 #include<errno.h>
32 #include<time.h>
33
34 #include<sys/socket.h>
35 #include<sys/sendfile.h>
36 #include<arpa/inet.h>
37 #include<netdb.h>
38
39 #include"hash.h"
40
41 #define SELECT_TRUNC 0x00

42 #define SELECT_TOURN 0x01
43
44 struct command_opt
45 {
46 char** flag;
47 int counts;
48 int dif_index;
49 };
50
51 struct inhab
52 {
53 char* member,
54 * line;
55 int fitness,
56 mem_size;
57 int choices[29];
58 };
59
60 void cleanup_pop(struct inhab** pop,

61 const uint64_t size);
62
63 void cleanup_opt(struct command_opt** opt,

64 const uint64_t size);
65
66 char init_population(struct inhab** dest,

67 const uint64_t population_size,
68 const uint64_t option_count);
69
70 char parse_options(struct command_opt** dest,
51
71 int* size,
72 const char* file);
73
74 void run(int num_gen,

75 const char* output_file,
76 struct command_opt** options,
77 int option_size,
78 struct inhab** population,
79 int population_size);
80
81 void reproduction(FILE* log, struct inhab* a, struct inhab*

b,
82 struct inhab* dest);
83
84
85 void build_flags(FILE* log, struct inhab** pop, int pop_size

, struct command_opt** opts, int opt_size);
86 void fitness(FILE* log, struct inhab** pop, int pop_size,
87 struct command_opt** opts, int opt_size);
88 void mutation(FILE* log, struct inhab** dest, int
population_size, int option_size);
89
90 void selection(FILE* log, struct inhab** pop, int pop_size);

91
92 int fitness_compare(const void* a, const void* b);

93
94 void ep_sl(void* buffer, int size);

95 void ep_sr(void* buffer, int size);
96 void ep_or(char* buffer, unsigned char* or, int size);
97 void print_flags(FILE* out, struct inhab** pop, int pop_size
);
98
99 //hash_t table;
100
101 struct sockaddr_in fit_addr;

102 int fit_sock ;
103 #endif
A.2.2 main.c
1 #include"main.h"
2
3 char init_population(struct inhab** dest,

4 const uint64_t population_size,
52
5 const uint64_t option_count)
6 {
7 char* line;
8 int rand_fd;
9
10 line = malloc((int)((option_count/8)+1));
13 return 0x01;
14 }
15 rand_fd = open("/dev/urandom", O_RDONLY);
16 if(rand_fd == -1) {
17 perror("Could Not Open /dev/random");
18 free(line);
19 return 0x01;
20 }
21
22 *dest = malloc(sizeof(struct inhab) *

population_size);
23 for(int i = 0; i < population_size; i++) {
24 ((*dest)+i)->fitness = -1;
25 ((*dest)+i)->member = malloc((int)((
option_count/8)+1));
26 if( ((*dest)+i)->member == NULL) {
28 }
29 ((*dest)+i)->mem_size = (int)((option_count
/8)+1);
30 ((*dest)+i)->line = malloc(8192);
31 if( ((*dest)+i)->line == NULL) {
32 perror("Allocating Line Memory");
33 }
34 memset(&(((*dest)+i)->choices), 0, 116);
35
36 read(rand_fd, line, (int)((option_count/8)

+1));
37 strcpy(((*dest)+i)->member, line);
38 }
39
40
41 close(rand_fd);
42 free(line);
43
53
44 return 0;
45 }
46
47 char parse_options(struct command_opt** dest,

48 int* size,
49 const char* file)
50 {
51
52 FILE* cmd = fopen(file, "r");

53 if(cmd == NULL) {
54 perror("Opening Option File");
55 return 0x01;
56 }
57 char* line = malloc(80);
60 fclose(cmd);
61 return 0x01;
62 }
63
64 *dest = malloc(sizeof(struct command_opt)*512);

65 register int index = 0, tmp;
66 unsigned int line_size = 80;
67
68 int diff_index = 0;
69
70 while(!feof(cmd)) {
71 fscanf(cmd, "%s", line);
72 if(!strncmp(line, "EITHER", 6)) {
73 fscanf(cmd, "%i", &(((*dest)+index)
->counts));
74 (((*dest)+index)->flag) = malloc(((*
dest)+index)->counts * sizeof(
char*));
75 ((*dest)+index)->dif_index =
diff_index;
76 diff_index++;
77 for(tmp = 0; tmp < ((*dest)+index)->
counts; tmp++) {
78 ((*dest)+index)->flag[tmp] =
malloc(80);
79 getline(&((*dest)+index)->
flag[tmp], &line_size,
54
cmd);
80
81 }
82 }
83 else {
84 (((*dest)+index)->flag) = malloc(
sizeof(char*));
85 ((*dest)+index)->flag[0] = malloc
(80);
86 strcpy(((*dest)+index)->flag[0],
line);
87 ((*dest)+index)->counts = 1;
88 }
89 index++;
90 }
91
92 *size = index;
93 free(line);
94 fclose(cmd);
95
96 // init_hash(&table, pow(2, index));

97
98 return 0x00;
99 }
100
101
102 void ep_sl(void* buffer, int size)

103 {
104 for(unsigned char* tmp = buffer; size--; ++tmp) {
105 unsigned char bit = 0x00;
106 if(size) {
107 bit = tmp[1] & (1 << 7) ? 1 : 0;
108 }
109
110 *tmp <<= 1;

111 *tmp |= bit;
112 }
113 }
114
115 void ep_sr(void* buffer, int size)

116 {
117 for(unsigned char* tmp = buffer; size--; ++tmp) {
118 unsigned char bit = 0x00;
55
119 if(size) {
120 bit = tmp[1] & 1 ? 1 : 0;
121 }
122 *tmp >>= 1;
123 *tmp |= bit;
124 }
125 }
126
127 void ep_or(char* buffer, unsigned char* or, int size)

128 {
129 for(int i = 0; i < size; i++) {
130 *(buffer+i) |= *(or+i);
131 }
132 }
133
134 void run(int num_gen,

135 const char* output_file,
136 struct command_opt** options,
137 int option_size,
138 struct inhab** population,
139 int population_size)
140 {
141 FILE* log = fopen(output_file, "w+");
142 for(int i =0; i < num_gen; i++) {
143 fprintf(log, "Generation %i\n", i);
144 fitness(log,population, population_size,
145 options, option_size);
146 selection(log, population, population_size);
147 mutation(log, population, population_size,
option_size);
148 fprintf(log, "\n\n");
149 print_flags(log, population, population_size
);
150 }
151 fclose(log);
152 }
153
154
155 void build_flags(FILE* log, struct inhab** pop, int pop_size

,
156 struct command_opt** opts, int opt_size)
157 {
158 char* tmp;
56
159 for(int i = 0; i < pop_size; i++) {
160 int byte = 0;
161 if(((*pop)+i)->fitness == -1) {
162 for(int j=0; ((byte*8)+j) < opt_size
; j++) {
163 if(*((((*pop)+i)->member)+
byte) & (1 << j)) {
164 if( ((*opts)+((byte
*8)+j))->counts >
1) {
165 if(((*pop)+i
)->
choices[
166 ((*
opts
)
+((
byte
*8)
+
j
)
)
->
dif_inde
]
==
0)
167 {
168 ((*
pop
)
+
i
)
->
choices
[
169 ((
57
170 ra
171 (
172 strcat
(
((*
pop
)
+
58
i
)
->
line
,
173 ((*
o
)
+
b
*
+
j
)
-
f
[
174 ((*
p
)
+
i
)
-
c
[
175 ((*
o
)
+
b
*
+
j
)
-
d
]
;
59
176 strcat
(
((*
pop
)
+
i
)
->
line
,
"
")
;
177 }
178 }
179 else {
180 strcat( ((*
pop)+i)->
line,
181 ((*
opts
)
+((
byte
*8)
+
j
)
)
->
flag
[0])
;
182 strcat( ((*
pop)+i)->
line, "
");
183 }
184 }
60
185
186 if(j % 7 == 0) {
187 byte++;
188 j = 0;
189 }
190 }
191
192 tmp = index(((*pop)+i)->line, ’\n’);

193 while(tmp != NULL) {
194 *tmp = ’ ’;
195 tmp = index(((*pop)+i)->line
, ’\n’);
196 }
197 }
198 }
199 }
200
201 void print_flags(FILE* out, struct inhab** pop, int pop_size

)
202 {
203 for(int i =0; i < pop_size; i++) {
204 fprintf(out, "%i: %i = %s\n", i, ((*pop)+i)
->fitness,
205 ((*pop)+i)->line);
206 }
207 }
208
209 void send_get(struct inhab** pop, int i)

210 {
211 /* Send Kernel 1 to Netbook
212 * Wait for results
213 * Send kernel 2 to netbook
214 * wait for results
215 * Update fitness
216 */
217 int send_sock = socket(PF_INET, SOCK_STREAM,
218 0);
219
220 socklen_t _length;

221 struct sockaddr_in netbook, recv_addr;
222 memset(&netbook, 0, sizeof(struct sockaddr_in));
223 netbook.sin_family = AF_INET;
224 netbook.sin_port = htons(12345);
61
225 inet_pton(AF_INET, "141.195.226.19",
226 &netbook.sin_addr);
227
228 con:
229 if(connect(send_sock, (struct sockaddr*)&netbook,
230 sizeof(netbook)) == -1) {
231 perror("Connecting to Netbook... Trying
again");
232 sleep(10);
233 goto con;
234 }
235
236
237 fprintf(stderr, "Successfully Connected to Netbook")

;
238
239 int file = open("/home/cs8/kukunaj/"

240 "kernel.tar", O_RDONLY);
241 if(file == -1) {
242 perror("Finding Kernel Tar");
243 close(send_sock);
244 }
245
246 struct stat file_stat;

247 fstat(file, &file_stat);
248
249 int meh = 0;

250 int s;
251 char* meh_ = malloc(1024);
252 while( meh < file_stat.st_size) {
253 s = read(file, meh_, 1023);
254 meh += send(send_sock, meh_, s, 0);
255 }
256 shutdown(send_sock, SHUT_RDWR);
257 close(file);
258
259 int new_sock = accept(fit_sock, (struct sockaddr*) &

recv_addr, &_length);
260
261 if(new_sock == -1) {

262 perror("Accepting Local Socket");
263 }
264 if(recv(new_sock, meh_, 1024, 0) < 1) {
62
265 perror("Reading Fitness Score");
266 }
267
268 ((*pop)+i)->fitness = atoi(meh_);

269 fprintf(stdout, "Kernel Recieved a fitness of %s\n",
meh_);
270 close(new_sock);
271
272 /* Kernel 2 */
273
274 send_sock = socket(PF_INET, SOCK_STREAM,

275 0);
276
277 memset(&netbook, 0, sizeof(struct sockaddr_in));

280 inet_pton(AF_INET, "141.195.226.19",
282
283 _con:
284 if(connect(send_sock, (struct sockaddr*)&netbook,
285 sizeof(netbook)) == -1) {
286 perror("Connecting to Netbook... Trying
again");
287 sleep(10);
288 goto _con;
289 }
290
291
292 fprintf(stderr, "Successfully Connected to Netbook")

;
293
294 file = open("/home/cs8/kukunaj/"

295 "_kernel.tar", O_RDONLY);
296 if(file == -1) {
299 }
300
301 struct stat file_stat2;

302 fstat(file, &file_stat2);
303
304 memset(meh_, 0, 1024);
63
305 meh = 0;
306 while( meh < file_stat2.st_size) {
307 s = read(file, meh_, 1023);
309 }
311 close(file);
312
313 new_sock = accept(fit_sock, (struct sockaddr*) &

recv_addr, &_length);
314
315 if(new_sock == -1) {

316 perror("Accepting Local Socket");
317 }
318 if(recv(new_sock, meh_, 1024, 0) < 1) {
319 perror("Reading Fitness Score");
320 }
321
322 ((*pop)+i+1)->fitness = atoi(meh_);

323 fprintf(stdout, "Kernel Recieved a fitness of %s\n",
meh_);
324
325 free(meh_);
327 }
328
329 void fitness(FILE* log, struct inhab** pop, int pop_size,

330 struct command_opt** opts, int opt_size)
331 {
332 build_flags(log, pop, pop_size, opts, opt_size);
333 for(int i =0; i < pop_size; i=i+2) {
334
335 int child = fork();

336
337 if(child == -1) {

338 perror("Creating Fitness Child");
339 }
340
341 if(child == 0) {
342 if( ((*pop)+i)->fitness == -1) {
343
344 int sock = socket(

PF_INET,
64
SOCK_STREAM, 0);
345 struct sockaddr_in
builder;
346
347 memset(&builder, 0,
sizeof(struct
sockaddr_in));
348 builder.sin_family =
AF_INET;
349 builder.sin_port =
htons(12345);
350 // inet_pton(AF_INET,
"141.195.226.146",
351 inet_pton(AF_INET,
"141.195.226.136",
352 &
builder
.
sin_addr
)
;
353
354
355 if(connect(sock, (
struct sockaddr*)
&builder,
356 size
(
bu
)
)
==
-1
{
357 perror("
Connecting
to
Builder")
;
65
358 close(sock);
359 }
360
361 send(sock, ((*pop)+i

)->line, strlen
(((*pop)+i)->line
), 0);
362 shutdown(sock,
SHUT_RDWR);
363
364
365 struct sockaddr_in

meh_addr;
366 socklen_t length__;
367 char* buffer= malloc
(80);
368
369
370 if(fit_sock == -1) {

371 fit_sock =
socket(
PF_INET,
SOCK_STREAM
, 0);
372
373 memset(&
fit_addr,
0,
sizeof(
struct
sockaddr_in
));
374 fit_addr.
sin_family
=
AF_INET;
375 fit_addr.
sin_port
= htons
(12345);
376 fit_addr.
sin_addr.
s_addr =
66
INADDR_ANY
;
377 if(bind(
fit_sock,
(struct
sockaddr
*)&
fit_addr,
378 size
(
fi
)
)
==
-1
379 perror
("
Binding
Failed
")
;
380 break
;
381 }
382
383 if(listen(
fit_sock,
1024) ==
-1) {
384 perror
("
Listening
")
;
385 break
;
386 }
67
387
388 }
389 int sock_n = accept(
fit_sock, (struct
sockaddr*)&
meh_addr,
390 &
length__
)
;
391
392 if(sock_n == -1) {

393 perror("
Accepting
Connection
");
394 break;
395 }
396 if(recv(sock_n,
buffer, 80,0) <
1) {
397 perror("
Reading
Fitness
Score");
398 break;
399 }
400
401 ((*pop)+i)->fitness
= atoi(buffer);
402 fprintf(stdout, "
Kernel Recieved
Fitness of %s\n",
buffer);
403 //hstore(&table, ((*
pop)+i)->member,
((*pop)+i)->
fitness);
404 free(buffer);
405 _exit(0);
406 }
407
68
408 }
409 else {
410 /*
411 int store_ = hload(&table, ((*pop)+i
+1)->member);
412 if(store_ != 0) {
413 ((*pop)+i+1)->fitness =
store_;
414 }
415 else {
416 */
417 char* HOSTCFLAGS = malloc
(8192+13),
418 * HOSTCXXFLAGS = malloc
(8192+15);
419
420 strcpy(HOSTCFLAGS, "

HOSTCFLAGS=");
421 strcat(HOSTCFLAGS, ((*pop)+i
+1)->line);
422 strcpy(HOSTCXXFLAGS, "
HOSTCXXFLAGS=");
423 strcat(HOSTCXXFLAGS, ((*pop)
+i+1)->line);
424 fprintf(stdout, "\n\
nBuilding With Flags of\n
"
425 "%s\n", ((*
pop)+i+1)
->line);
426
427 int status;

428 int pid;
429
430
431 pid = fork();

432 if(pid == -1) {
433 perror("Again,
Fitness Child");
434 }
435 else if(pid == 0) {
436 chdir("/home/cs8/
kukunaj/_linux
69
-2.6.30.5/");
437 execl("/usr/bin/make
", "make", "-j4",
438 "
clean
",
NULL
)
;
439 }
440 waitpid(pid, &status, 0);
441
442 pid = fork();

443 if(pid == -1) {
444 perror("Stuff With
Child");
445 }
446 else if(pid == 0){
kukunaj/_linux
-2.6.30.5/");
", "make", "-j4",
"oldconfig",
449 NULL
)
;
450 }
453 if(WEXITSTATUS(
status) != 0) {
454 fprintf(
stderr, "
Error
Building"
455 "
oldconfig
\
n
")
;
70
456 ((*pop)+i+1)
->fitness
= -500;
457 break;
458 }
459 }
460
461
462 pid = fork();

463 if(pid == -1) {
464 perror("Yet Another
Fitness Child");
465 }
466 else if(pid == 0) {
kukunaj/_linux
-2.6.30.5/");
", "make","-j4",
469 "AR=
xiar
",
470 HOSTCFLAGS,
HOSTCXXFLAGS
,
471 "LD=xild", "
bzImage",
472 NULL);
473 }
474

477 if(WEXITSTATUS(
status) != 0) {
478 fprintf(
stderr, "
Error
Building"
479 "
bzImage
\
n
71
")
;
480 ((*pop)+i+1)
->fitness
= -500;
481 break;
482 }
483 }
484
485 pid = fork();

486 if(pid == -1) {
487 perror("Seriously
not Cool");
488 }
489 else if(pid == 0) {
490 execl("/bin/tar", "
tar", "-cvf", "/
home/cs8/kukunaj/
_kernel.tar",
491 "/home/cs8/
kukunaj/
_linux
-2.6.30.5/
arch/x86
/boot/
bzImage
",
492 "/home/cs8/
kukunaj/
_linux
-2.6.30.5/
System.
map",
NULL);
493 }
494

496 waitpid(child, &status,0);
497
498
499 int send_sock = socket(

PF_INET, SOCK_STREAM,
500 0);
72
501 if(send_sock == -1) {
502 perror("Creating the
send socket");
503 }
504
505 struct sockaddr_in netbook;

506 memset(&netbook, 0, sizeof(
struct sockaddr_in));
507 netbook.sin_family = AF_INET
;
508 netbook.sin_port = htons
(12345);
509 inet_pton(AF_INET,
"141.195.226.19",
510 &netbook.
sin_addr)
;
511
512 con:
513 if(connect(send_sock, (
struct sockaddr*)&netbook
,
514 sizeof
(
netbook
)
)
==
-1)
{
515 perror("Connecting
to Netbook...
Trying again");
516 sleep(10);
517 goto con;
518 }
519
520
521 fprintf(stderr, "

Successfully Connected to
73
Netbook");
522
523 int file = open("/home/cs8/

kukunaj/"
524 "_kernel.tar
",
O_RDONLY)
;
525 if(file == -1) {
526 perror("Finding
Kernel Tar");
528 }
529

531 off_t off = 0;
533
534 int meh = 0;

535 int s;
537 while( meh < file_stat.
st_size) {
538 s = read(file, meh_
, 1023);
539 meh += send(
send_sock, meh_,
s, 0);
540 }
541 shutdown(send_sock,
SHUT_RDWR);
542 free(meh_);
543 close(file);
544
545 struct sockaddr_in recv_addr

;
546 socklen_t _length;
547 char* rcbuffer = malloc(80);
548
549 if(fit_sock == -1) {

550 fit_sock = socket(
PF_INET,
SOCK_STREAM, 0);
74
551
552 memset(&fit_addr, 0,
sizeof(struct
sockaddr_in));
553 fit_addr.sin_family
= AF_INET;
554 fit_addr.sin_port =
htons(12345);
555 fit_addr.sin_addr.
s_addr =
INADDR_ANY;
556 if(bind(fit_sock, (
struct sockaddr*)
&fit_addr,
557 sizeof
(
fit_addr
)
)
==
-1)
{
558 perror("
Binding
Failed");
559 }
560
561 if(listen(fit_sock,
1024) == -1) {
562 perror("
Listening
");
563 }
564
565 }
566
567 int new_sock = accept(

fit_sock, (struct
sockaddr*) &recv_addr, &
_length);
75
568
569 if(new_sock == -1) {

570 perror("Accepting
Local Socket");
571 break;
572 }
573 if(recv(new_sock, rcbuffer,
80, 0) < 1) {
574 perror("Reading 2nd
Fitness Score");
575 }
576
577 ((*pop)+i+1)->fitness = atoi

(rcbuffer);
578 fprintf(stdout, "Kernel
Recieved a fitness of %s\
n", rcbuffer);
579
581 free(rcbuffer);
582 }
583 }
584 }
585
586 void reproduction(FILE* log, struct inhab* a, struct inhab*

b,
587 struct inhab* dest)
588 {
589 unsigned char* tmp = malloc(a->mem_size),
590 * tmp0 = malloc(a->mem_size);
591 memset(dest->member, 0, dest->mem_size);
592
593 memcpy(tmp, a->member,a->mem_size );

594 memcpy(tmp0, b->member, a->mem_size);
595
596
597 for(int i =0; i < (int)floor((strlen(a->member)*8)

/2); i++) {
598 ep_sl(tmp, a->mem_size);
599 ep_sr(tmp0,a->mem_size);
600 }
601
602 /*
76
603 *(dest->member) |= ( *(a->member) << (int)floor(
604 (strlen(a->member)*8)/2));
605 *(dest->member) |= ( *(b->member) >> (int)floor(
606 (strlen(b->member)*8)/2));
607 */
608
609 ep_or(dest->member, tmp, dest->mem_size);

610
611 dest->fitness = -1;

612 memset(dest->line, 0, strlen(dest->line));
613 free(tmp);
614 free(tmp0);
615 }
616
617
618 int fitness_compare(const void* a, const void* b)

619 {
620 if( ((struct inhab*)a)->fitness > ((struct inhab*)b)
->fitness)
621 {
622 return 1;
623 }
624 else if( ((struct inhab*)a)->fitness ==
625 ((struct inhab*)b)->fitness) {
626 return 0;
627 }
628 else {
629 return -1;
630 }
631 }
632
633
634 void selection(FILE* log, struct inhab** pop, int pop_size)

635 {
636 qsort(*pop, pop_size, sizeof(struct inhab),
637 fitness_compare);
638 fprintf(log, "The highest fitness this generation
was %i from\n%s\n",
639 ((*pop)+pop_size-1)->fitness, ((*pop
)+pop_size-1)->line);
640 for(int i=0; i < (int)floor(.25 * (pop_size-1)); i
++) {
77
641 int part = rand() % (int)floor(.50 * (
pop_size-1)) +
642 (int) floor(.25 * (pop_size
-1));
643 reproduction(log, ((*pop)+pop_size-i-1),
644 ((*pop)+pop_size -part-1),
645 ((*pop)+i));
646 }
647 }
648
649
650
651 void mutation(FILE* log,struct inhab** dest, int

population_size, int option_size)
652 {
653 unsigned char mask;
654 int x, rnd;
655
656 rnd = open("/dev/urandom", O_RDONLY);

657
658 for(int i =0; i < population_size; i++) {

659
660 x = rand() % 99;

661 if(x > 97) {
662 read(rnd, &mask, 1);
663 x = rand() % (int)((option_size/8)
+1);
664 *(((*dest)+i)->member+x) |= mask;
665 ((*dest)+i)->fitness = -1;
666 }
667 }
668 close(rnd);
669 }
670
671 void cleanup_pop(struct inhab** pop,

672 const uint64_t size)
673 {
674 for(int i =0; i < size; i++) {
675 free( ((*pop)+i)->member);
676 free(((*pop)+i)->line);
677 }
678 free(*pop);
679 }
78
680
681 void cleanup_opt(struct command_opt** opt,

682 const uint64_t size)
683 {
684 for(int i =0; i < size; i++) {
685 for(int j =0; j < ((*opt)+i)->counts; j++) {
686 free(((*opt)+i)->flag[j]);
687 }
688 free((((*opt)+i)->flag));
689 }
690 free(*opt);
691 // free_hash(&table);
692 }
693

695 {
696 struct inhab* pop;
697 struct command_opt* opts;
698 int size;
699 if(parse_options(&opts, &size, "_ou") != 0x00) {
701 }
702
703 if(init_population(&pop, 10, size) != 0x00) {

705 }
706
707 fit_sock = socket(PF_INET, SOCK_STREAM, 0);

708 if(fit_sock == -1) {
709 perror("Creating Waiting Socket");
710 }
711
712 memset(&fit_addr, 0, sizeof(struct sockaddr_in));

713 fit_addr.sin_family = AF_INET;
714 fit_addr.sin_port = htons(12345);
715 fit_addr.sin_addr.s_addr = INADDR_ANY;
716 if(bind(fit_sock, (struct sockaddr*)&fit_addr,
717 sizeof(fit_addr)) == -1) {
718 perror("Binding Failed");
719 }
720
721 if(listen(fit_sock, 1024) == -1) {

722 perror("Listening");
79
723 }
724
725 run(10, "out.dat", &opts, size, &pop, 10);

726
727 cleanup_pop(&pop, 10);

728 cleanup_opt(&opts, size);
729 return EXIT_SUCCESS;
730 }
A.2.3 Makefile
1 CC =gcc
2 OUT = out
3
4 debug:
5 ${CC} main.c hash.c -std=c99 -g -Wall -pedantic -o $
{OUT} -lm
6 release:
7 ${CC} main.c hash.c -std=c99 -O2 -lm -o ${OUT}
8 parser:
9 lex opt_parser.lex
10 ${CC} lex.yy.c -o lex_parser -ll
A.3 Builder Code

A.3.1 main.c
1 #define _BSD_SOURCE
2
3 #include<stdio.h>
6
10 #include<sys/sendfile.h>
14
17 #include<netdb.h>
80
18
19 void send_bad_fitness(void);
20 void send_complete(void);
21

23 {
24 struct sockaddr_in addr, new_addr;
25 char* HOSTCFLAGS = malloc(8192+13),
26 *HOSTCXXFLAGS = malloc(8192+15),
27 *buffer;
28
29 int sock = socket(PF_INET, SOCK_STREAM, 0),

30 new_sock, child, status, send_sock;
31 socklen_t length;
32 if(sock == -1) {
33 perror("Creating Socket");
35 }
36
37 madvise(&addr, sizeof(struct sockaddr_in),

MADV_DONTFORK);
38 madvise(&new_addr, sizeof(struct sockaddr_in),
MADV_DONTFORK);
39 madvise(&sock, sizeof(int), MADV_DONTFORK);
40 madvise(&new_sock, sizeof(int), MADV_DONTFORK);
41 madvise(&length, sizeof(socklen_t), MADV_DONTFORK);
42
43 memset(&addr, 0, sizeof(struct sockaddr_in));

44 addr.sin_family = AF_INET;
45 addr.sin_port = htons(12345);
46 addr.sin_addr.s_addr = INADDR_ANY;
47 if(bind(sock, (struct sockaddr*) &addr, sizeof(addr)
) == -1) {
48 perror("Binding");
50 }
51
52 if(listen(sock, 1024) == -1) {

55 }
56
57 buffer = malloc(8192);
81
58 while(1) {
59 begin:
60 length = sizeof(new_addr);
61 new_sock = accept(sock, (struct sockaddr*)&
new_addr,
62 &length);
63 if(new_sock == -1) {
64 perror("Accepting");
65 break;
66 }
67 memset(buffer, 0, 8192);
68
69 recv(new_sock, buffer, 8192, 0);

70
71 strcpy(HOSTCFLAGS, "HOSTCFLAGS=");
72 strcat(HOSTCFLAGS, buffer);
73 strcpy(HOSTCXXFLAGS, "HOSTCXXFLAGS=");
74 strcat(HOSTCXXFLAGS, buffer);
75
76 fprintf(stdout, "Preparing to Build Kernel

with Flags :\n%s\n", buffer);
77
78 child = fork();
79 if(child == -1) {
80 perror("Another Fork Problem");
81 break;
82 }
83 else if(child == 0) {
84 chdir("/home/cs8/kukunaj/linux
-2.6.30.5/");
85 execl("/usr/bin/make", "make", "-j4
",
86 "clean", NULL);
87 }
88 waitpid(child, &status, 0);
89
90 child = fork();
91 if(child == -1){
92 perror("Comeon");
93 break;
94 }
82
-2.6.30.5/");
",
98 "oldconfig",NULL);
99 }
101
102
103 child = fork();

104 if(child == -1) {
105 perror("Forking Child");
106 break;
107 }
-2.6.30.5/");
",
111 "AR=xiar",
112 HOSTCFLAGS,
HOSTCXXFLAGS,
113 "LD=xild", "bzImage
",
114 NULL);
115 }
118 if(WEXITSTATUS(status) != 0) {
119 send_bad_fitness();
121 goto begin;
122 }
123 }
124
125 child = fork();

126 if(child == -1) {
127 perror("Seriously not Cool");
128 }
130 execl("/bin/tar", "tar", "-cvf", "/
home/cs8/kukunaj/kernel.tar",
83
131 "/home/cs8/kukunaj/linux
-2.6.30.5/arch/x86/boot/
bzImage",
132 "/home/cs8/kukunaj/linux
-2.6.30.5/System.map",
NULL);
133 }
135
136 send_sock = socket(PF_INET, SOCK_STREAM,

137 0);
138 if(send_sock == -1) {
139 perror("Creating the send socket");
141 close(sock);
142 break;
143 }
144
145 struct sockaddr_in netbook;

146 memset(&netbook, 0, sizeof(struct
sockaddr_in));
149 inet_pton(AF_INET, "141.195.226.19",
151
152 if(connect(send_sock, (struct sockaddr*)&

netbook,
153 sizeof(netbook)) ==
-1) {
154 perror("Connecting to Netbook");
157 break;
158 }
159
160 int file = open("/home/cs8/kukunaj/"

161 "kernel.tar.bz2", O_RDONLY);
162 if(file == -1) {
164 close(sock);
84
167 break;
168 }
169

172 int meh = 0;
173 int s;
175 while( meh < file_stat.st_size) {
176 s = read(file, meh_, 1023);
178 }
180 free(meh_);
181 close(file);
183 }
184 free(HOSTCFLAGS);
185 free(HOSTCXXFLAGS);
186 free(buffer);
187 }
188
189 void send_complete(void)

190 {
191 int sock = socket(PF_INET, SOCK_STREAM, 0);
192 struct sockaddr_in addr;
196 inet_pton(AF_INET, "141.195.226.145",
197 &addr.sin_addr);
198 meh:
199 if(connect(sock, (struct sockaddr*)&addr, sizeof(
addr)) != 0) {
200 sleep(10);
201 goto meh;
202 }
203 char* mes = malloc(5);
204 strcpy(mes, "DONE");
205 send(sock, mes, 5, 0);
206 shutdown(sock, SHUT_RDWR);
207 free(mes);
208 close(sock);
85
209 }
210
211 void send_bad_fitness(void)

212 {
213 char* fit;
216

220 inet_pton(AF_INET, "141.195.226.145",
221 &addr.sin_addr);
addr)) != 0) {
223 perror("Connecting to Send Bad Fitness");
224 }
225
226 fit = malloc(5);

227 strcpy(fit, "-500");
228
229 if(send(sock, fit, 5, 0) <= 0) {

230 perror("Sending Bad Fitness");
231 }
232 shutdown(sock,SHUT_RDWR);
233 free(fit);
234 }
A.3.2 Makefile
1 CC=gcc
2 OUT = builder
3
4 debug:
5 ${CC} -Wall -pedantic -std=c99 main.c -o ${OUT} -g
6 release:
7 ${CC} main.c -std=c99 -O2 -o ${OUT}
A.4 Client Code

A.4.1 main.c
1 #define _XOPEN_SOURCE 600
2 #define _POSIX_C_SOURCE 199309L
86
3
4 #include<stdio.h>
7 #include<errno.h>
9 #include<time.h>
10 #include<limits.h>
11
16 #include<linux/reboot.h>
17 #include<sys/reboot.h>
19 #include<netinet/in.h>
20 #include<netinet/ip.h>
22
23
24 void spec_diff(struct timespec* start, struct timespec* end,

25 struct timespec* dest)
26 {
27 if( (end->tv_nsec - start->tv_nsec) <0) {
28 dest->tv_sec = end->tv_sec - start->tv_sec
-1;
29 dest->tv_nsec = 1000000000 + end->tv_nsec -
30 start->tv_nsec;
31 }
32 else {
33 dest->tv_sec = end->tv_sec - start->tv_sec;
34 dest->tv_nsec = end->tv_nsec - start->
tv_nsec;
35 }
36 return;
37 }
38
39 void spec_add(struct timespec* start, struct timespec* end,

40 struct timespec* dest)
41 {
42 dest->tv_sec = start->tv_sec + end->tv_sec;
43 dest->tv_nsec = start->tv_nsec + end->tv_nsec;
87
44 if(dest->tv_nsec > 1000000000) {
45 dest->tv_sec++;
46 dest->tv_nsec -= 1000000000;
47 }
48 return;
49 }
50
51 void send_fitness(int fitness)

52 {
53 char* fit;
56

60 inet_pton(AF_INET, "141.195.226.145",
61 &addr.sin_addr);
62
63 meh:
addr)) != 0) {
65 perror("Connecting to Send Fitness");
66 sleep(10);
67 goto meh;
68 }
69 fprintf(stdout, "Succesfully Connected\n");
70
71 fit = malloc(15);
72 memset(fit, 0, 15);
73 sprintf(fit, "%d", fitness);
74
75 if(send(sock, fit, 15, 0) <= 0) {

76 perror("Sending Fitness");
77 }
78 shutdown(sock,SHUT_RDWR);
79 free(fit);
80 }
81
82
83
84 int runTests(void)
85 {
88
86 clockid_t cl;
87 int child;
88 struct timespec start, end, total,
89 op, mm, clo, fs64, rd, fork, wr, mum
;
90
91 if(clock_getcpuclockid(0, &cl) != 0) {
92 perror("Getting Clock Id");
94 }
95
96 char* test = malloc(80);

97 int file;
98 ssize_t out;
99 struct stat* fst;
100 strcpy(test, "/home/jim/test.meh");
101
102 /*************************************
103 * Open
104 ************************************/
105 clock_gettime(cl, &start);
106 __asm__ __volatile__ (
107 "movl $5, %%eax\n\t"
108 "movl %1, %%ebx\n\t"
109 "movl $2, %%ecx\n\t"
110 "int $0x80\n\t"
111 "movl %%eax, %0"
112 : "=r" (file)
113 : "r" (test)
114 : "%eax", "%ebx", "%ecx");
115 clock_gettime(cl, &end);
116 spec_diff(&start,&end, &op);
117
118
119 /************************************
120 * fstat64
121 ************************************/
124 "movl $197, %%eax\n\t"
125 "movl %1, %%ebx\n\t"
126 "movl %2, %%ecx\n\t"
127 "int $0x80 \n\t"
89
128 "movl %%eax, %0"
129 : "=r" (out)
130 : "r" (file), "r" (fst)
131 : "%eax", "%ebx", "%ecx");
133 spec_diff(&start, &end, &fs64);
134
135 /**************************************
136 * Write
137 *************************************/
140 "movl $4, %%eax\n\t"
141 "movl %1, %%ebx\n\t"
142 "movl %2, %%ecx\n\t"
143 "movl $20, %%edx\n\t"
144 "int $0x80 \n\t"
145 "movl %%eax, %0"
146 : "=r" (out)
147 : "r" (file), "r" (test)
148 : "%eax", "%ebx", "%ecx", "%edx");
150 spec_diff(&start, &end, &wr);
151
152 spec_add(&op, &wr, &total);

153
154
155 /*************************************
156 * Read
157 ************************************/
160 "movl $3, %%eax\n\t"
161 "movl %1, %%ebx\n\t"
162 "movl %2, %%ecx\n\t"
163 "movl $10, %%edx\n\t"
164 "int $0x80 \n\t"
165 "movl %%eax, %0"
166 : "=r" (out)
167 : "r" (file), "r" (test)
168 : "%eax", "%ebx", "%ecx", "%edx");
170 spec_diff(&start, &end, &rd);
90
171
172 spec_add(&total, &rd, &total);

173
174
175 /**************************************
176 * Close
177 **************************************/
180 "movl $6, %%eax\n\t"
181 "movl %1, %%ebx\n\t"
182 "int $0x80 \n\t"
183 "movl %%eax, %0"
184 : "=r" (file)
185 : "r" (file)
186 : "%eax", "%ebx");
188 spec_diff(&start, &end, &clo);
189
190 void* buffer;

191 /***************************************
192 * MMAP
193 ***************************************/
196 "movl $192, %%eax\n\t"
197 "movl $0, %%ebx\n\t"
198 "movl $8192, %%ecx\n\t"
199 "movl $0x3, %%edx\n\t"
200 "movl $0x22, %%esi\n\t"
201 "movl $-1, %%edi\n\t"
202 "movl $0, %%ebp\n\t"
203 "int $0x80 \n\t"
204 "movl %%eax, %0"
205 : "=m" (buffer)
206 :
207 : "%eax", "%ebx", "%ecx", "%edx", "%esi", "%
edi",
208 "%ebp");
210 spec_diff(&start, &end, &mm);
211
212 /***************************************
91
213 * MUNMAP
214 ***************************************/
217 "movl $91, %%eax\n\t"
218 "movl %0, %%ebx\n\t"
219 "movl $8192, %%ecx\n\t"
220 "int $0x80"
221 :
222 : "r" (buffer)
223 : "%eax", "%ebx", "%ecx");
225 spec_diff(&start, &end, &mum);
226
227
228
229 spec_add(&total, &clo, &total);

230
231 free(test);
232
233 /****************************************
234 * FORK
235 ****************************************/
238 "movl $2, %%eax\n\t"
239 "int $0x80 \n\t"
240 "movl %%eax, %0"
241 : "=r" (child)
242 :
243 :"%eax");
244 if(child == 0) {
245 _exit(0);
246 }
247

249 spec_diff(&start, &end, &fork);
250
251 spec_add(&total, &fork, &total);

252
253
254 int fit = INT_MAX;

255 for(long i = total.tv_sec; i > 0; i--) {
92
256 fit -= 10000000;
257 }
258 fit -= (int)(total.tv_nsec % 1000000);
259
260 fprintf(stdout, "Fitness of %d\n", fit);

261 send_fitness(fit);
262
263 return 0;
264 }
265
266
267
268
269 void receive_kern(void)

270 {
271
272 struct sockaddr_in addr, new_addr;

273 int new_sock;
274 socklen_t length;
275

277 if(sock == -1) {
278 perror("Creating Socket");
279 return;
280 }
281

285 addr.sin_addr.s_addr = INADDR_ANY;
286 if(bind(sock, (struct sockaddr*) &addr, sizeof(addr)
) == -1) {
287 perror("Binding");
288 return;
289 }
290
291 if(listen(sock, 1024) == -1) {

293 close(sock);
294 return;
295 }
296
297 int kern_tar = creat("/root/kernel.tar", S_IRWXU);
93
298
299 char *buffer = malloc(4096);

301 length = sizeof(addr);
302 if( (new_sock = accept(sock, (struct sockaddr*)&
new_addr,
303 &length)) == -1) {
304 perror("Accepting connection");
305 return;
306 }
307
308 int l = 0;
309 l = recv(new_sock, buffer, 4096, 0);
310 write(kern_tar, buffer, l);
311 while( (l = recv(new_sock, buffer, 4096, 0)) != 0) {
312 write(kern_tar, buffer, l);
314 }
315
316 close(kern_tar);
317 free(buffer);
319 close(sock);
320
321 sync();
322
323 fprintf(stdout, "Finished Installing Kernel");

324 }
325
326 int install_kern(void)

327 {
328 int child, status;
329
330 child = fork();

331 if(child == -1) {
332 perror("Install Child");
333 return -1;
334 }
335 else if(child ==0) {
336 chdir("/root/");
337 execl("/bin/tar", "tar", "-xvf", "/root/
kernel.tar", NULL);
338 }
94
339
340 wait(&status);
343 fprintf(stderr, "Corrupt Kernel Tar\
n");
344 send_fitness(-500);
345 return -1;
346 }
347 }
348
349 child = fork();

350 if(child == -1) {
351 perror("Meh");
352 }
354 execl("/sbin/installkernel", "installkernel
",
355 "2.6.30.5linux_dna",
356 "/root/home/cs8/kukunaj/
_linux-2.6.30.5/arch/x86/
boot/bzImage",
357 "/root/home/cs8/kukunaj/
_linux-2.6.30.5/System.
map", NULL);
358 }
359

363 fprintf(stdout, "Error Installing
Kernel\n");
364 return -1;
365 }
366 }
367
368 return 0;
369 }
370
371 char get_state(void)

372 {
373 int child, status;
374 child = fork();
95
375 if(child == 0) {
376 execl("/bin/sh", "sh", "/root/check.sh",
NULL);
377 }
378

381 if(WEXITSTATUS(status) == 1) {
382 return 0x00;
383 }
384 else {
385 return 0x01;
386 }
387 }
388 return 0x00;
389 }
390
391 void change_state(char s)

392 {
393 if(s == 0x00) {
394 int file = creat("/root/.ga/wait", O_RDWR);
395 if(file == -1) {
396 perror("Creating Wait File");
397 }
398 close(file);
399 }
400 else {
401 if(unlink("/root/.ga/wait") != 0) {
402 perror("Could not delete");
403 }
404 }
405
406 sync();
407 }
408

410 {
411 char state;
412
413 state = get_state();

414 if(state == 0x00) {
415 meh:
416 receive_kern();
96
417 if(install_kern() == -1) {
418 goto meh;
419 }
420 fprintf(stdout, "Kernel Installed
Successfully");
421 change_state(0x01);
422 sync();
423 reboot(LINUX_REBOOT_CMD_RESTART);
424 }
425 else {
426 runTests();
427 change_state(0x00);
428 goto meh;
429
430 }
431 }
A.4.2 check.sh
1 #!/bin/bash
2
3 if [ -f /root/.ga/wait ];
4 then
5 echo ’waiting’
6 exit 1
7 else
8 echo ’running’
9 exit 2
10 fi
A.4.3 Makefile
1 CC = gcc
2 OUT = GAClient
3
4 debug:
5 ${CC} main.c -std=c99 -Wall -pedantic -g -o ${OUT}
-lrt
6 release:
7 ${CC} main.c -std=c99 -o ${OUT} -lrt
A.5 Compiler Options

1 EITHER 6
97
2 -O1
3 -O2
4 -O3
5 -O
6 -Os
7 -O0
8 -fast
9 EITHER 2
10 -fno-alias
11 -fno-fnalias
12 -fno-builtin
13 -ffunction-sections
14 -fdata-sections
15 -nolib-inline
16 EITHER 7
17 -xSSE2
18 -xSSE3
19 -xSSE3_ATOM
20 -xSSSE3
21 -xSSE4.1
22 -xSSE4.2
23 -xAVX
24 EITHER 6
25 -axSSE2
26 -axSSE3
27 -axSSSE3
28 -axSSE4.1
29 -axSSE4.2
30 -axAVX
31 EITHER 2
32 -mcpu=pentium3
33 -mcpu=pentium4
34 EITHER 2
35 -mtune=pentium3
36 -mtune=pentium4
37 EITHER 2
38 -march=pentium3
39 -march=pentium4
40 -mia32
41 EITHER 5
42 -msse
43 -msse2
44 -msse3
98
45 -mssse3
46 -fomit-frame-pointer
47 -fexceptions
48 -fnon-call-exceptions
49 EITHER 4
50 -unroll0
51 -unroll
52 -unroll-aggresive
53 -funroll-loops
54 -scalar-rep
55 -complex-limited-range
56 -alias-const
57 EITHER 3
58 -fargument-alias
59 -fargument-noalias
60 -fargument-noalias-global
61 -opt-multi-version-aggressive
62 EITHER 5
63 -opt-ra-region-strategy=routine
64 -opt-ra-region-strategy=block
65 -opt-ra-region-strategy=trace
66 -opt-ra-region-strategy=loop
67 -opt-ra-region-strategy=default
68 EITHER 2
69 -no-vec
70 -vec
71 EITHER 2
72 -no-vec-guard-write
73 -vec-guard-write
74 EITHER 5
75 -opt-malloc-options=0
80 -opt-calloc
81 EITHER 3
82 -opt-jump-tables=default
83 -opt-jump-tables=large
84 -fno-jump-tables
85 -opt-subscript-in-range
86 -use-intel-optimized-headers
87 -par-runtime-control
99
88 EITHER 8
89 -par-schedule-static
90 -par-schedule-static-balanced
91 -par-schedule-static-steal
92 -par-schedule-dynamic
93 -par-schedule-guided
94 -par-schedule-guided-analytical
95 -par-schedule-runtime
96 -par-schedule-auto
97 EITHER 3
98 -fp-speculation=fast
99 -fp-speculation=safe
100 -fp-speculation=strict
101 -prec-sqrt
102 -prec-div
103 -fast-transcendentals
104 -fp-port
105 -rcd
106 -ftz
107 EITHER 3
108 -inline-level=0
109 -inline-level=1
110 -inline-level=2
111 -finline
112 -inline-forceinline
113 -inline-calloc
114 EITHER 5
115 -Zp1
116 -Zp2
117 -Zp4
118 -Zp8
119 -Zp16
120 -align
121 -freg-struct-return
122 -no-bss-init
123 EITHER 2
124 -falign-functions=2
125 -falign-functions=16
126 EITHER 3
127 -falign-stack=default
128 -falign-stack=maintain-16-byte
129 -falign-stack=assume-16-byte
100
Appendix B
Phoronix Results
B.1 Fedora Stock

1 <?xml version="1.0"?>
2 <?xml-stylesheet type="text/xsl" href="pts-results-viewer.
xsl" ?>
3 
4 <PhoronixTestSuite>
5 <Benchmark>
6 <Name>LAME MP3 Encoding</Name>
7 <Version>3.98.2</Version>
8 <Attributes>WAV To MP3</Attributes>
9 <Scale>Seconds</Scale>
10 <Proportion>LIB</Proportion>
11 <ResultFormat>BAR_GRAPH</ResultFormat>
12 <TestName>encode-mp3</TestName>
13 <TestArguments></TestArguments>
14 <Results>
15 <Group>
16 <Entry>
17 <Identifier>run0</
Identifier>
18 <Value>163.03</Value
>
19 <RawString
>164.2292330265:163.539414
RawString>
20 </Entry>
21 </Group>
22 </Results>
23 </Benchmark>
24 <Benchmark>
25 <Name>Ogg Encoding</Name>
27 <Attributes>WAV To Ogg</Attributes>
31 <TestName>encode-ogg</TestName>
33 <Results>
34 <Group>
35 <Entry>
Identifier>
>
38 <RawString
>108.84432721138:108.52697
RawString>
39 </Entry>
40 </Group>
41 </Results>
42 </Benchmark>
43 <Benchmark>
44 <Name>FFmpeg</Name>
45 <Version>0.5</Version>
46 <Attributes>AVI To NTSC VCD</Attributes>
50 <TestName>ffmpeg</TestName>
52 <Results>
53 <Group>
54 <Entry>
Identifier>
56 <Value>94.52</Value>
57 <RawString
>95.454313993454:93.732383
RawString>
58 </Entry>
59 </Group>
60 </Results>
102
61 </Benchmark>
62 <Benchmark>
63 <Name>7-Zip Compression</Name>
65 <Attributes>Compress Speed Test</Attributes>
66 <Scale>MIPS</Scale>
67 <Proportion>HIB</Proportion>
69 <TestName>compress-7zip</TestName>
71 <Results>
72 <Group>
73 <Entry>
Identifier>
>
76 <RawString
>837:854:837</
RawString>
77 </Entry>
78 </Group>
79 </Results>
80 </Benchmark>
81 <Benchmark>
82 <Name>SciMark</Name>
84 <Attributes>Composite</Attributes>
85 <Scale>Mflops</Scale>
88 <TestName>scimark2</TestName>
89 <TestArguments>TEST_COMPOSITE</TestArguments
>
90 <Results>
91 <Group>
92 <Entry>
Identifier>
>
95 <RawString
>119.52:121.28:120.83:121.
103
RawString>
96 </Entry>
97 </Group>
98 </Results>
99 </Benchmark>
100 <Benchmark>
103 <Attributes>Fast Fourier Transform</
Attributes>
108 <TestArguments>TEST_FFT</TestArguments>
109 <Results>
110 <Group>
111 <Entry>
Identifier>
114 <RawString
>22.71:22.56:22.66:22.61</
RawString>
115 </Entry>
116 </Group>
117 </Results>
118 </Benchmark>
119 <Benchmark>
122 <Attributes>Monte Carlo</Attributes>
127 <TestArguments>TEST_MONTE</TestArguments>
128 <Results>
129 <Group>
130 <Entry>
Identifier>
104
133 <RawString
>41.94:41.68:41.81:42.07</
RawString>
134 </Entry>
135 </Group>
136 </Results>
137 </Benchmark>
138 <Benchmark>
139 <Name>IOzone</Name>
141 <Attributes>512MB Write Performance</
Attributes>
142 <Scale>MB/s</Scale>
145 <TestName>iozone</TestName>
146 <TestArguments>-s 512M -i0</TestArguments>
147 <Results>
148 <Group>
149 <Entry>
Identifier>
152 <RawString
>35.669921875:36.7421875:3
RawString>
153 </Entry>
154 </Group>
155 </Results>
156 </Benchmark>
157 <Benchmark>
160 <Attributes>512MB Read Performance</
Attributes>
165 <TestArguments>-s 512M -i0 -i1</
TestArguments>
166 <Results>
167 <Group>
105
168 <Entry>
Identifier>
>
171 <RawString
>381.8984375:378.137695312
RawString>
172 </Entry>
173 </Group>
174 </Results>
175 </Benchmark>
176 <Benchmark>
179 <Attributes>1GB Write Performance</
Attributes>
185 <Results>
186 <Group>
187 <Entry>
Identifier>
190 <RawString
>33.3193359375:33.45996093
RawString>
191 </Entry>
192 </Group>
193 </Results>
194 </Benchmark>
195 <Benchmark>
196 <Name>SQLite</Name>
198 <Attributes>12,500 INSERTs</Attributes>
202 <TestName>sqlite</TestName>
106
204 <Results>
205 <Group>
206 <Entry>
Identifier>
>
209 <RawString
>201.50340890884:205.29840
RawString>
210 </Entry>
211 </Group>
212 </Results>
213 </Benchmark>
214 <Benchmark>
215 <Name>GnuPG</Name>
217 <Attributes>2GB File Encryption</Attributes>
221 <TestName>gnupg</TestName>
223 <Results>
224 <Group>
225 <Entry>
Identifier>
>
228 <RawString
>169.73490190506:168.99539
RawString>
229 </Entry>
230 </Group>
231 </Results>
232 </Benchmark>
233 <Benchmark>
234 <Name>C-Ray</Name>
236 <Attributes>Total Time</Attributes>
107
240 <TestName>c-ray</TestName>
242 <Results>
243 <Group>
244 <Entry>
Identifier>
246 <Value>2570.03</
Value>
247 <RawString
>2567.216:2565.403:2577.48
RawString>
248 </Entry>
249 </Group>
250 </Results>
251 </Benchmark>
252 <Benchmark>
253 <Name>RAMspeed</Name>
255 <Attributes>Integer Add</Attributes>
259 <TestName>ramspeed</TestName>
260 <TestArguments>ADD -b 3 -l 10</TestArguments
>
261 <Results>
262 <Group>
263 <Entry>
Identifier>
265 <Value>1888.01</
Value>
266 <RawString>1888.01</
RawString>
267 </Entry>
268 </Group>
269 </Results>
270 </Benchmark>
271 <Benchmark>
272 <Name>GtkPerf</Name>
108
274 <Attributes>GtkComboBox</Attributes>
278 <TestName>gtkperf</TestName>
279 <TestArguments>COMBOBOX</TestArguments>
280 <Results>
281 <Group>
282 <Entry>
Identifier>
>
285 <RawString>163.89</
RawString>
286 </Entry>
287 </Group>
288 </Results>
289 </Benchmark>
290 <Benchmark>
293 <Attributes>GtkDrawingArea - Pixbufs</
Attributes>
298 <TestArguments>DRAWING_PIXBUFS</
TestArguments>
299 <Results>
300 <Group>
301 <Entry>
Identifier>
304 <RawString>23.65</
RawString>
305 </Entry>
306 </Group>
307 </Results>
308 </Benchmark>
109
309 <Benchmark>
312 <Attributes>GtkRadioButton</Attributes>
317 <TestArguments>RADIO_BUTTON</TestArguments>
318 <Results>
319 <Group>
320 <Entry>
Identifier>
RawString>
324 </Entry>
325 </Group>
326 </Results>
327 </Benchmark>
328 <System>
329 <Hardware>Processor: Intel Atom CPU N270 @
1.60GHz (Total Cores: 2), Motherboard:
Acer AOA150, Chipset: Intel Mobile 945GME
Express Hub + ICH7-M, System Memory: 997
MB, Disk: 160GB WDC WD1600BEVT-2,
Graphics: Intel Mobile 945GME Express IGP
(rev 03)</Hardware>
330 <Software>OS: Fedora 10, Kernel:
2.6.27.19-170.2.35.fc10.i686 (i686),
Display Server: X.Org Server 1.5.3,
Display Driver: intel 2.5.0, OpenGL: 1.4
Mesa 7.3-devel, Compiler: GCC 4.3.2, File
-System: ext3, Screen Resolution: 1024
x600</Software>
331 <Author>jim</Author>
332 <TestDate>April 11, 2010 08:46 AM</TestDate>
333 <TestNotes>2D Acceleration: EXA.
334 Intel SpeedStep Technology was enabled</TestNotes>
336 <AssociatedIdentifiers>run0</
AssociatedIdentifiers>
110
337 </System>
338 <Suite>
339 <Title>fedorastock</Title>
340 <Name>netbook</Name>
342 <Description>This test suite is designed to
test various aspects of a netbook/net-top
/UMPC computer.</Description>
343 <Type>System</Type>
344 <Extensions></Extensions>
345 <TestProperties></TestProperties>
346 </Suite>
347 </PhoronixTestSuite>
B.2 ICC
xsl" ?>
3 
5 <Benchmark>
14 <Results>
15 <Group>
16 <Entry>
17 <Identifier
>2010-04-10
18:17</Identifier
>
>
19 <RawString
>163.98458313942:161.78006
RawString>
20 </Entry>
21 </Group>
111
22 </Results>
23 </Benchmark>
24 <Benchmark>
33 <Results>
34 <Group>
35 <Entry>
36 <Identifier
>2010-04-10
18:17</Identifier
>
>
38 <RawString
>108.51464605331:107.78667
RawString>
39 </Entry>
40 </Group>
41 </Results>
42 </Benchmark>
43 <Benchmark>
52 <Results>
53 <Group>
54 <Entry>
55 <Identifier
>2010-04-10
18:17</Identifier
>
112
57 <RawString
>95.681114912033:93.412126
RawString>
58 </Entry>
59 </Group>
60 </Results>
61 </Benchmark>
62 <Benchmark>
71 <Results>
72 <Group>
73 <Entry>
74 <Identifier
>2010-04-10
18:17</Identifier
>
>
76 <RawString
>858:853:863</
RawString>
77 </Entry>
78 </Group>
79 </Results>
80 </Benchmark>
81 <Benchmark>
>
113
90 <Results>
91 <Group>
92 <Entry>
93 <Identifier
>2010-04-10
18:17</Identifier
>
>
95 <RawString
>120.91:120.75:119.29:120.
RawString>
96 </Entry>
97 </Group>
98 </Results>
99 </Benchmark>
100 <Benchmark>
Attributes>
109 <Results>
110 <Group>
111 <Entry>
112 <Identifier
>2010-04-10
18:17</Identifier
>
114 <RawString
>20.22:20.03:19.88:20.03</
RawString>
115 </Entry>
116 </Group>
117 </Results>
118 </Benchmark>
119 <Benchmark>
114
128 <Results>
129 <Group>
130 <Entry>
131 <Identifier
>2010-04-10
18:17</Identifier
>
133 <RawString
>41.68:41.30:41.55:42.07</
RawString>
134 </Entry>
135 </Group>
136 </Results>
137 </Benchmark>
138 <Benchmark>
Attributes>
147 <Results>
148 <Group>
149 <Entry>
150 <Identifier
>2010-04-10
18:17</Identifier
>
152 <RawString
>45.7919921875:55.64941406
RawString>
115
153 </Entry>
154 </Group>
155 </Results>
156 </Benchmark>
157 <Benchmark>
Attributes>
TestArguments>
166 <Results>
167 <Group>
168 <Entry>
169 <Identifier
>2010-04-10
18:17</Identifier
>
>
171 <RawString
>531.8134765625:546.182617
RawString>
172 </Entry>
173 </Group>
174 </Results>
175 </Benchmark>
176 <Benchmark>
Attributes>
185 <Results>
186 <Group>
116
187 <Entry>
188 <Identifier
>2010-04-10
18:17</Identifier
>
190 <RawString
>48.7265625:48.8232421875:
RawString>
191 </Entry>
192 </Group>
193 </Results>
194 </Benchmark>
195 <Benchmark>
204 <Results>
205 <Group>
206 <Entry>
207 <Identifier
>2010-04-10
18:17</Identifier
>
209 <RawString
>62.594249010086:65.346024
RawString>
210 </Entry>
211 </Group>
212 </Results>
213 </Benchmark>
214 <Benchmark>
117
223 <Results>
224 <Group>
225 <Entry>
226 <Identifier
>2010-04-10
18:17</Identifier
>
>
228 <RawString
>162.87016987801:163.39538
RawString>
229 </Entry>
230 </Group>
231 </Results>
232 </Benchmark>
233 <Benchmark>
242 <Results>
243 <Group>
244 <Entry>
245 <Identifier
>2010-04-10
18:17</Identifier
>
246 <Value>2564.34</
Value>
247 <RawString
>2564.897:2564.278:2563.84
RawString>
248 </Entry>
249 </Group>
250 </Results>
118
251 </Benchmark>
252 <Benchmark>
>
261 <Results>
262 <Group>
263 <Entry>
264 <Identifier
>2010-04-10
18:17</Identifier
>
265 <Value>2006.93</
Value>
266 <RawString>2006.93</
RawString>
267 </Entry>
268 </Group>
269 </Results>
270 </Benchmark>
271 <Benchmark>
280 <Results>
281 <Group>
282 <Entry>
283 <Identifier
>2010-04-10
18:17</Identifier
>
119
>
285 <RawString>148.14</
RawString>
286 </Entry>
287 </Group>
288 </Results>
289 </Benchmark>
290 <Benchmark>
Attributes>
TestArguments>
299 <Results>
300 <Group>
301 <Entry>
302 <Identifier
>2010-04-10
18:17</Identifier
>
RawString>
305 </Entry>
306 </Group>
307 </Results>
308 </Benchmark>
309 <Benchmark>
318 <Results>
120
319 <Group>
320 <Entry>
321 <Identifier
>2010-04-10
18:17</Identifier
>
RawString>
324 </Entry>
325 </Group>
326 </Results>
327 </Benchmark>
328 <System>
(rev 03)</Hardware>
330 <Software>OS: Fedora 10, Kernel: 2.6.30.5
noopts (i686), Display Server: X.Org
Server 1.5.3, Display Driver: intel
2.5.0, OpenGL: 2.1 Mesa 7.3-devel,
Compiler: GCC 4.3.2, File-System: ext3,
Screen Resolution: 1024x600</Software>
332 <TestDate>April 10, 2010 09:49 PM</TestDate>
336 <AssociatedIdentifiers>2010-04-10 18:17</
337 </System>
338 <Suite>
339 <Title>netbook_noopt</Title>
121
346 </Suite>
B.3 Run0
xsl" ?>
3 
5 <Benchmark>
14 <Results>
15 <Group>
16 <Entry>
Identifier>
>
19 <RawString
>163.28898477554:162.53107
RawString>
20 </Entry>
21 </Group>
22 </Results>
23 </Benchmark>
24 <Benchmark>
122
33 <Results>
34 <Group>
35 <Entry>
Identifier>
>
38 <RawString
>107.87712788582:107.99775
RawString>
39 </Entry>
40 </Group>
41 </Results>
42 </Benchmark>
43 <Benchmark>
52 <Results>
53 <Group>
54 <Entry>
Identifier>
57 <RawString
>95.523708820343:93.556678
RawString>
58 </Entry>
59 </Group>
60 </Results>
61 </Benchmark>
62 <Benchmark>
123
71 <Results>
72 <Group>
73 <Entry>
Identifier>
>
76 <RawString
>859:867:854</
RawString>
77 </Entry>
78 </Group>
79 </Results>
80 </Benchmark>
81 <Benchmark>
>
90 <Results>
91 <Group>
92 <Entry>
Identifier>
>
95 <RawString
>120.61:119.80:119.86:119.
RawString>
96 </Entry>
97 </Group>
98 </Results>
99 </Benchmark>
100 <Benchmark>
124
Attributes>
109 <Results>
110 <Group>
111 <Entry>
Identifier>
114 <RawString
>19.95:19.99:20.03:20.37</
RawString>
115 </Entry>
116 </Group>
117 </Results>
118 </Benchmark>
119 <Benchmark>
128 <Results>
129 <Group>
130 <Entry>
Identifier>
133 <RawString
>41.81:41.94:41.81:41.81</
RawString>
134 </Entry>
135 </Group>
136 </Results>
137 </Benchmark>
138 <Benchmark>
125
Attributes>
147 <Results>
148 <Group>
149 <Entry>
Identifier>
152 <RawString
>41.4755859375:55.01855468
RawString>
153 </Entry>
154 </Group>
155 </Results>
156 </Benchmark>
157 <Benchmark>
Attributes>
TestArguments>
166 <Results>
167 <Group>
168 <Entry>
Identifier>
>
171 <RawString
>527.1669921875:534.735351
RawString>
126
172 </Entry>
173 </Group>
174 </Results>
175 </Benchmark>
176 <Benchmark>
Attributes>
185 <Results>
186 <Group>
187 <Entry>
Identifier>
190 <RawString
>44.7880859375:45.50390625
RawString>
191 </Entry>
192 </Group>
193 </Results>
194 </Benchmark>
195 <Benchmark>
204 <Results>
205 <Group>
206 <Entry>
Identifier>
127
209 <RawString
>63.119623184204:63.700066
RawString>
210 </Entry>
211 </Group>
212 </Results>
213 </Benchmark>
214 <Benchmark>
223 <Results>
224 <Group>
225 <Entry>
Identifier>
>
228 <RawString
>162.97913503647:163.28654
RawString>
229 </Entry>
230 </Group>
231 </Results>
232 </Benchmark>
233 <Benchmark>
242 <Results>
243 <Group>
244 <Entry>
128
Identifier>
246 <Value>2564.19</
Value>
247 <RawString
>2563.709:2564.232:2564.64
RawString>
248 </Entry>
249 </Group>
250 </Results>
251 </Benchmark>
252 <Benchmark>
>
261 <Results>
262 <Group>
263 <Entry>
Identifier>
265 <Value>2006.98</
Value>
266 <RawString>2006.98</
RawString>
267 </Entry>
268 </Group>
269 </Results>
270 </Benchmark>
271 <Benchmark>
129
280 <Results>
281 <Group>
282 <Entry>
Identifier>
>
285 <RawString>151.37</
RawString>
286 </Entry>
287 </Group>
288 </Results>
289 </Benchmark>
290 <Benchmark>
Attributes>
TestArguments>
299 <Results>
300 <Group>
301 <Entry>
Identifier>
RawString>
305 </Entry>
306 </Group>
307 </Results>
308 </Benchmark>
309 <Benchmark>
130
318 <Results>
319 <Group>
320 <Entry>
Identifier>
RawString>
324 </Entry>
325 </Group>
326 </Results>
327 </Benchmark>
328 <System>
(rev 03)</Hardware>
run2 (i686), Display Server: X.Org Server
1.5.3, Display Driver: intel 2.5.0,
OpenGL: 2.1 Mesa 7.3-devel, Compiler: GCC
4.3.2, File-System: ext3, Screen
Resolution: 1024x600</Software>
337 </System>
338 <Suite>
339 <Title>run2</Title>
131
346 </Suite>
B.4 Run1
xsl" ?>
3 
5 <Benchmark>
14 <Results>
15 <Group>
16 <Entry>
Identifier>
>
19 <RawString
>163.22724604607:162.69049
RawString>
20 </Entry>
21 </Group>
22 </Results>
23 </Benchmark>
24 <Benchmark>
132
33 <Results>
34 <Group>
35 <Entry>
Identifier>
>
38 <RawString
>107.78687500954:107.99853
RawString>
39 </Entry>
40 </Group>
41 </Results>
42 </Benchmark>
43 <Benchmark>
52 <Results>
53 <Group>
54 <Entry>
Identifier>
57 <RawString
>95.719970941544:93.913688
RawString>
58 </Entry>
59 </Group>
60 </Results>
61 </Benchmark>
62 <Benchmark>
133
71 <Results>
72 <Group>
73 <Entry>
Identifier>
>
76 <RawString
>863:865:865</
RawString>
77 </Entry>
78 </Group>
79 </Results>
80 </Benchmark>
81 <Benchmark>
>
90 <Results>
91 <Group>
92 <Entry>
Identifier>
>
95 <RawString
>120.73:120.56:120.80:120.
RawString>
96 </Entry>
97 </Group>
98 </Results>
99 </Benchmark>
100 <Benchmark>
134
Attributes>
109 <Results>
110 <Group>
111 <Entry>
Identifier>
114 <RawString
>20.07:20.14:20.10:21.05</
RawString>
115 </Entry>
116 </Group>
117 </Results>
118 </Benchmark>
119 <Benchmark>
128 <Results>
129 <Group>
130 <Entry>
Identifier>
133 <RawString
>41.81:41.30:42.07:42.07</
RawString>
134 </Entry>
135 </Group>
136 </Results>
137 </Benchmark>
135
138 <Benchmark>
Attributes>
147 <Results>
148 <Group>
149 <Entry>
Identifier>
152 <RawString
>44.083984375:54.630859375
RawString>
153 </Entry>
154 </Group>
155 </Results>
156 </Benchmark>
157 <Benchmark>
Attributes>
TestArguments>
166 <Results>
167 <Group>
168 <Entry>
Identifier>
>
171 <RawString
>518.37890625:519.46191406
136
RawString>
172 </Entry>
173 </Group>
174 </Results>
175 </Benchmark>
176 <Benchmark>
Attributes>
185 <Results>
186 <Group>
187 <Entry>
Identifier>
190 <RawString
>44.7919921875:44.68945312
RawString>
191 </Entry>
192 </Group>
193 </Results>
194 </Benchmark>
195 <Benchmark>
204 <Results>
205 <Group>
206 <Entry>
Identifier>
137
209 <RawString
>62.0997569561:65.31823015
RawString>
210 </Entry>
211 </Group>
212 </Results>
213 </Benchmark>
214 <Benchmark>
223 <Results>
224 <Group>
225 <Entry>
Identifier>
>
228 <RawString
>164.5307559967:165.627771
RawString>
229 </Entry>
230 </Group>
231 </Results>
232 </Benchmark>
233 <Benchmark>
242 <Results>
243 <Group>
244 <Entry>
138
Identifier>
246 <Value>2563.40</
Value>
247 <RawString
>2563.28:2563.756:2563.17<
RawString>
248 </Entry>
249 </Group>
250 </Results>
251 </Benchmark>
252 <Benchmark>
>
261 <Results>
262 <Group>
263 <Entry>
Identifier>
265 <Value>2021.47</
Value>
266 <RawString>2021.47</
RawString>
267 </Entry>
268 </Group>
269 </Results>
270 </Benchmark>
271 <Benchmark>
139
280 <Results>
281 <Group>
282 <Entry>
Identifier>
>
285 <RawString>146.95</
RawString>
286 </Entry>
287 </Group>
288 </Results>
289 </Benchmark>
290 <Benchmark>
Attributes>
TestArguments>
299 <Results>
300 <Group>
301 <Entry>
Identifier>
RawString>
305 </Entry>
306 </Group>
307 </Results>
308 </Benchmark>
309 <Benchmark>
140
318 <Results>
319 <Group>
320 <Entry>
Identifier>
RawString>
324 </Entry>
325 </Group>
326 </Results>
327 </Benchmark>
328 <System>
(rev 03)</Hardware>
332 <TestDate>April 11, 2010 04:58 AM</TestDate>
337 </System>
338 <Suite>
141
346 </Suite>
B.5 Run2
xsl" ?>
3 
5 <Benchmark>
14 <Results>
15 <Group>
16 <Entry>
Identifier>
>
19 <RawString
>163.17384290695:162.00460
RawString>
20 </Entry>
21 </Group>
22 </Results>
23 </Benchmark>
24 <Benchmark>
142
33 <Results>
34 <Group>
35 <Entry>
Identifier>
>
38 <RawString
>107.60985088348:108.16382
RawString>
39 </Entry>
40 </Group>
41 </Results>
42 </Benchmark>
43 <Benchmark>
52 <Results>
53 <Group>
54 <Entry>
Identifier>
57 <RawString
>94.920183897018:93.461187
RawString>
58 </Entry>
59 </Group>
60 </Results>
61 </Benchmark>
62 <Benchmark>
143
71 <Results>
72 <Group>
73 <Entry>
Identifier>
>
76 <RawString
>858:867:861</
RawString>
77 </Entry>
78 </Group>
79 </Results>
80 </Benchmark>
81 <Benchmark>
>
90 <Results>
91 <Group>
92 <Entry>
Identifier>
>
95 <RawString
>119.90:119.70:119.86:120.
RawString>
96 </Entry>
97 </Group>
98 </Results>
99 </Benchmark>
100 <Benchmark>
144
Attributes>
109 <Results>
110 <Group>
111 <Entry>
Identifier>
114 <RawString
>20.26:20.26:20.41:20.45</
RawString>
115 </Entry>
116 </Group>
117 </Results>
118 </Benchmark>
119 <Benchmark>
128 <Results>
129 <Group>
130 <Entry>
Identifier>
133 <RawString
>42.07:41.81:41.43:42.07</
RawString>
134 </Entry>
135 </Group>
136 </Results>
137 </Benchmark>
145
138 <Benchmark>
Attributes>
147 <Results>
148 <Group>
149 <Entry>
Identifier>
152 <RawString
>44.2880859375:53.87011718
RawString>
153 </Entry>
154 </Group>
155 </Results>
156 </Benchmark>
157 <Benchmark>
Attributes>
TestArguments>
166 <Results>
167 <Group>
168 <Entry>
Identifier>
>
171 <RawString
>517.7275390625:520.118164
146
RawString>
172 </Entry>
173 </Group>
174 </Results>
175 </Benchmark>
176 <Benchmark>
Attributes>
185 <Results>
186 <Group>
187 <Entry>
Identifier>
190 <RawString
>43.5712890625:43.22949218
RawString>
191 </Entry>
192 </Group>
193 </Results>
194 </Benchmark>
195 <Benchmark>
204 <Results>
205 <Group>
206 <Entry>
Identifier>
147
209 <RawString
>61.145431995392:63.320338
RawString>
210 </Entry>
211 </Group>
212 </Results>
213 </Benchmark>
214 <Benchmark>
223 <Results>
224 <Group>
225 <Entry>
Identifier>
>
228 <RawString
>164.57925820351:166.04097
RawString>
229 </Entry>
230 </Group>
231 </Results>
232 </Benchmark>
233 <Benchmark>
242 <Results>
243 <Group>
244 <Entry>
148
Identifier>
246 <Value>2563.91</
Value>
247 <RawString
>2564.097:2563.603:2564.03
RawString>
248 </Entry>
249 </Group>
250 </Results>
251 </Benchmark>
252 <Benchmark>
>
261 <Results>
262 <Group>
263 <Entry>
Identifier>
265 <Value>2018.61</
Value>
266 <RawString>2018.61</
RawString>
267 </Entry>
268 </Group>
269 </Results>
270 </Benchmark>
271 <Benchmark>
149
280 <Results>
281 <Group>
282 <Entry>
Identifier>
>
285 <RawString>149.88</
RawString>
286 </Entry>
287 </Group>
288 </Results>
289 </Benchmark>
290 <Benchmark>
Attributes>
TestArguments>
299 <Results>
300 <Group>
301 <Entry>
Identifier>
RawString>
305 </Entry>
306 </Group>
307 </Results>
308 </Benchmark>
309 <Benchmark>
150
318 <Results>
319 <Group>
320 <Entry>
Identifier>
RawString>
324 </Entry>
325 </Group>
326 </Results>
327 </Benchmark>
328 <System>
(rev 03)</Hardware>
337 </System>
338 <Suite>
151
346 </Suite>
152
Bibliography
[1] Michael Abrash. Graphics Programming Black Block, chapter 7, pages 8–10.
[2] Boaz Barak and Shai Halevi. A model and architecture for pseudo-random gen-
eration with applications to /dev/random. In CCS ’05: Proceedings of the 12th
ACM conference on Computer and communications security, pages 203–212, New
York, NY, USA, 2005. ACM.
[3] Paul E Black. Fisher-yates shuffle. National Institute of Standards and Tech-
nology, 2009. http://www.itl.nist.gov/div897/sqg/dads/HTML/
fisherYatesShuffle.html.
[4] Keith D. Cooper, Philip J. Schielke, and Devika Subramanian. Optimizing for
reduced code space using genetic algorithms. In LCTES ’99: Proceedings of the
ACM SIGPLAN 1999 workshop on Languages, compilers, and tools for embedded
systems, pages 1–9, New York, NY, USA, 1999. ACM.
[5] Prancois Bodin Zbigniew Chamski Bjorn Franke Grigori Fursin Taras Glek
Yuriy Kashnikov Hugh Leather Adbul Wahid Memon Cupertino Miranda Mircea
Namolaru Diego Novillo Sebastian Pop Joern Rennecke Jeremy Singer Basile
Starynkevitch Ayal Zaks Fabio Arnone, Phil Barnard. Ctools: Milepost
gcc:motivation, 2010. http://ctuning.org/wiki/index.php/CTools:
MilepostGCC:Motivation.
[6] John E. Freund. Mathematical Statistics. Prentice-Hall INC, Englewood Cliffs,

New Jersey, first edition, 1962.
[7] Grigori Fursin, Cupertino Miranda, Olivier Temam, Mircea Namolaru, Elad
Yom-Tov, Ayal Zaks, Bilha Mendelson, Phil Barnard, Elton Ashton, Eric Cour-
tois, Francois Bodin, Edwin Bonilla, John Thomson, Hugh Leather, Chris
Williams, and Michael O’Boyle. Milepost gcc: machine learning based research
compiler. In Proceedings of the GCC Developers’ Summit, June 2008.
[8] Kenneth Hoste and Lieven Eeckhout. Cole: compiler optimization level explo-
ration. In CGO ’08: Proceedings of the sixth annual IEEE/ACM international
symposium on Code generation and optimization, pages 165–174, New York, NY,
USA, 2008. ACM.
153
[9] Intel. Intel(R) C++ Compiler User and Reference Guides. Document number:
304968-023US.
[10] Intel. Package type guide (desktop processors). http://www.intel.com/

support/processors/sb/CS-009863.htm.
[11] Intel. Package types for mobile intel processors. http://www.intel.com/

support/processors/sb/CS-009864.htm.
[12] Intel. Mobile Intel Atom Processor N270 Single Core Datasheet, May 2008.
[13] Ilhyun Kim and Mikko H. Lipasti. Implementing optimizations at decode time.
In ISCA ’02: Proceedings of the 29th annual international symposium on Com-
puter architecture, pages 221–232, Washington, DC, USA, 2002. IEEE Computer
Society.
[14] Alexey Kopytov. System performance benchmark, 2009. http://

sourceforge.net/projects/sysbench/.
[15] Prasad A. Kulkarni, David B. Whalley, Gary S. Tyson, and Jack W. Davidson.
Practical exhaustive optimization phase order exploration and evaluation. ACM
Trans. Archit. Code Optim., 6(1):1–36, 2009.
[16] Scott Ladd. Acovea genetic algorithm, 2009. http://coyotegulch.com/

products/acovea/acoveaga.html.
[17] Joe Maglitta. Smarter, faster open-source development. http://www.

smartertechnology.com, 07 2009.
[18] Phoronix Media. Phoronix test suite, 2009. http://www.

phoronix-test-suite.com/documentation/2.2/index.html.
[19] Brad L. Miller and David E. Goldberg. Genetic algorithms, selection schemes,
and the varying effects of noise. Evol. Comput., 4(2):113–131, 1996.
[20] Robert Muller-Albrecht. Optimized for the intel atom processor with intel’s
compiler. Technical note, Intel, January 2009.
[21] S. K. Park and K. W. Miller. Random number generators: good ones are hard
to find. Commun. ACM, 31(10):1192–1201, 1988.
[22] Bruce R. Miller Roldan Pozo. Java scimark. http://math.nist.gov/

scimark2/about.html.
[23] Todd Rowland. Genetic algorithm. MathWorld - A Wolfram Web Resource.

http://mathworld.wolfram.com/GeneticAlgorithm.html.
154
[24] Justin Ryan. Linuxdna supercharges linux with the intel c/c++ compiler. Linux
Journal, February 2009. http://www.linuxjournal.com/content/
linuxdna-supercharges-linux-intel-cc-compiler.
[25] Eric Schnarr and James R. Larus. Instruction scheduling and executable editing.
In MICRO 29: Proceedings of the 29th annual ACM/IEEE international sympo-
sium on Microarchitecture, pages 288–297, Washington, DC, USA, 1996. IEEE
Computer Society.
[26] Peter Selinger. The glibc pseudo-random number generator, 2007. http://
www.mscs.dal.ca/~selinger/random/.
[27] Marco Boero Staffan Algers, Eric Bernauer. Review of micro-simulation models,
1997. http://www.its.leeds.ac.uk/projects/smartest/deliv3.
html.
[28] Andrew S. Tanenbaum. Modern Operating Systems. Pearson Prentice Hall,

Upper Saddle River, New Jersey, third edition, 2008.
[29] Linus Torvalds. Linux torvalds on git. Google Tech Talk, 5 2007.
[30] Kent Wilken, Jack Liu, and Mark Heffernan. Optimal instruction scheduling
using integer programming. SIGPLAN Not., 35(5):121–133, 2000.
155

Allegheny Thesis

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Allegheny Thesis

Încărcat de

Drepturi de autor:

Formate disponibile

Technical Report CS10-02

A Genetic Algorithm to Improve

Submitted to the Faculty of

Project Director: Dr. G. M. Kapfhammer

I hereby recognize and pledge to fulfill my

As computers become increasingly mobile, users demand more functionality, longer

List of Tables vii

List of Figures viii

3 Design and Implementation 17

5 Conclusion and Future Work 42

B Phoronix Results 101

4.1 Genetic Algorithm Configurations. . . . . . . . . . . . . . . . . . . . 30

1.1 Kernel and User Space Interaction. . . . . . . . . . . . . . . . . . . . 9

3.1 Results From Running the Profiler. . . . . . . . . . . . . . . . . . . . 17

4.1 Phoronix Benchmark Results (Smaller Is Better). . . . . . . . . . . . 41

“Performance isn’t a secondary concern. It changes how you work.”

Optimizing compilers have the ability to perform various optimizations while

Listing 1 Loop before Loop Unrolling

movl $0, -8(%ebp)

Listing 3 Loop Unrolling

movl $0, -8(%ebp)

movl -8(%ebp), %eax

movl -8(%ebp), %eax

movl -8(%ebp), %eax

movl -8(%ebp), %eax

addl $5, -8(%ebp)

movl %eax, %edx

As opposed to micro-architectures designed for desktops and servers, mobile-

Listing 6 Example from Intel Atom Optimization guide [20].

Listing 7 Assembler Generated from Listing 6 with Memory Stall.

1.2 Linux Kernel

I/O Memory Mgt. Process Mgt.

Figure 1.1: Kernel and User Space Interaction.

1.3 Intel C Compiler

1.5 Thesis Statement

2.2 MILEPOST GCC

2.3 Cooper et. al.

2.5 Davidson et. al.

Design and Implementation

“Simple things should be simple and complex things should be possible”

3.1 The Profiler

Figure 3.1: Results From Running the Profiler.

To understand how kernel performance impacts user space performance, we take

3.2 The Genetic Algorithm

3.2.1 Population Representation

Flag Byte # Individual Overflow

3.2.2 Compiler Option Initialization

3.2.3 Initial Population

3.2.5 Reproduction, Mutation, and Selection

3.3 Build Farm

3.4 Testing Farm

Implementation of the testing farm can be found in Section A.4.

“An algorithm must be seen to be believed”

To evaluate the effectiveness of the designed genetic algorithm, first we experiment

Table 4.1: Genetic Algorithm Configurations.

Call Name Occurrence Count

Table 4.2: Application Profiler Results.

4.2 Resulting Kernels

Configuration Highest Fitness

Table 4.3: Highest Fitnesses For Each Configuration.

4.3 Produced Kernels

4.4 Application Benchmarks

22 dest = malloc(sizeof(struct inhab)

64 dest = malloc(sizeof(struct command_opt)512);