Sunteți pe pagina 1din 9

Understanding C by learning assembly

Last time, Alan showed how to use GDB as a tool to learn C. Today I want to go one step further and use GDB to help
us understand assembly as well.
Abstraction layers are great tools for building things, but they can sometimes get in the way of learning. My goal in this
post is to convince you that in order to rigorously understand C, we must also understand the assembly that our C
compiler generates. Ill do this by showing you how to disassemble and read a simple program with GDB, and then
well use GDB and our knowledge of assembly to understand how static local variables work in C.
Note: All the code in this post was compiled on an x86_64 CPU running Mac OS X 10.8.1 using Clang 4.0 with
optimizations disabled (-O0).

Learning assembly with GDB


Lets start by disassembling a program with GDB and learning how to read the output. Type the following program into
a text file and save it as simple.c:
int main() {
int a = 5;
int b = a + 6;
return 0;
}

Now compile it with debugging symbols and no optimizations and then run GDB:1
$ CFLAGS="-g -O0" make simple
cc -g -O0 simple.c -o simple
$ gdb simple

Inside GDB, well break on main and run until we get to the return statement. We put the number 2 after next to
specify that we want to run next twice:
(gdb) break main
(gdb) run
(gdb) next 2

Now lets use the disassemble command to show the assembly instructions for the current function. You can also
pass a function name to disassemble to specify a different function to examine.
(gdb) disassemble
Dump of assembler code for function main:
0x0000000100000f50 <main+0>: push %rbp
0x0000000100000f51 <main+1>: mov %rsp,%rbp
0x0000000100000f54 <main+4>: mov $0x0,%eax
0x0000000100000f59 <main+9>: movl $0x0,-0x4(%rbp)
0x0000000100000f60 <main+16>: movl $0x5,-0x8(%rbp)
0x0000000100000f67 <main+23>: mov -0x8(%rbp),%ecx
0x0000000100000f6a <main+26>: add $0x6,%ecx
0x0000000100000f70 <main+32>: mov %ecx,-0xc(%rbp)
0x0000000100000f73 <main+35>: pop %rbp
0x0000000100000f74 <main+36>: retq
End of assembler dump.
The disassemble command defaults to outputting instructions in AT&T syntax, which is the same syntax used by
the GNU assembler.2 Instructions in AT&T syntax are of the format mnemonic source, destination. The
mnemonic is a human readable name for the instruction. Source and destination are operands and can be immediate
values, registers, memory addresses, or labels. Immediate values are constants, and are prefixed by a $. For instance,
$0x5 represents the number 5 in hexadecimal. Register names are prefixed by a %.

Registers
Its worth taking a quick detour to understand registers. Registers are data storage locations directly on the CPU. With
some exceptions, the size, or width, of a CPUs registers define its architecture. So if you have a 64-bit CPU, your
registers will be 64 bits wide. The same is true of 32-bit CPUs (32-bit registers), 16-bit CPUs, and so on.3 Registers are
very fast to access and are often the operands for arithmetic and logic operations.
The x86 family has a number of general and special purpose registers. General purpose registers can be used for any
operation and their value has no particular meaning to the CPU. On the other hand, the CPU relies on special purpose
registers for its own operation and the values stored in them have a specific meaning depending on the register. In our
example above, %eax and %ecx are general purpose registers, while %rbp and %rsp are special purpose registers.
%rbp is the base pointer, which points to the base of the current stack frame, and %rsp is the stack pointer, which
points to the top of the current stack frame. %rbp always has a higher value than %rsp because the stack starts at a
high memory address and grows downwards. If you are unfamiliar with the call stack, you can find a good introduction
on Wikipedia.
One quirk of the x86 family is that it has maintained backwards compatibility all the way back to the 16-bit 8086
processor. As x86 moved from 16-bit to 32-bit to 64-bit, the registers were expanded and given new names so as to not
break backwards compatibility with code that was written for older, narrower CPUs.
Take the general purpose register AX, which is 16 bits wide. The high byte can be accessed with the name AH, and the
low byte with the name AL. When the 32-bit 80386 came out, the Extended AX register, or EAX, referred to the 32-bit
register, while AX continued to refer to a 16-bit register that made up the lower half of EAX. Similarly, when the
x86_64 architecture came out, the R prefix was used and EAX made up the lower half of the 64-bit RAX register.
Ive included a diagram below based on a Wikipedia article to help visualize the relationships I described:
|__64__|__56__|__48__|__40__|__32__|__24__|__16__|__8___|
|__________________________RAX__________________________|
|xxxxxxxxxxxxxxxxxxxxxxxxxxx|____________EAX____________|
|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|_____AX______|
|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|__AH__|__AL__|

Back to the code


This should be enough information to start reading our disassembled program.
0x0000000100000f50 <main+0>: push %rbp
0x0000000100000f51 <main+1>: mov %rsp,%rbp

The first two instructions are called the function prologue or preamble. First we push the old base pointer onto the stack
to save it for later. Then we copy the value of the stack pointer to the base pointer. After this, %rbp points to the base of
mains stack frame.
0x0000000100000f54 <main+4>: mov $0x0,%eax

This instruction copies 0 into %eax. The x86 calling convention dictates that a functions return value is stored in
%eax, so the above instruction sets us up to return 0 at the end of our function.
0x0000000100000f59 <main+9>: movl $0x0,-0x4(%rbp)

Here we have something we havent encountered before: -0x4(%rbp). The parentheses let us know that this is a
memory address. Here, %rbp is called the base register, and -0x4 is the displacement. This is equivalent to %rbp +
-0x4. Because the stack grows downwards, subtracting 4 from the base of the current stack frame moves us into the
current frame itself, where local variables are stored. This means that this instruction stores 0 at %rbp - 4. It took me
a while to figure out what this line was for, but it seems that clang allocates a hidden local variable for an implicit
return value from main.

Youll also notice that the mnemonic has the suffix l. This signifies that the operands will be long (32 bits for
integers). Other valid suffixes are byte, short, word, quad, and ten. If you see an instruction that does not have a
suffix, the size of the operands are inferred from the size of the source or destination register. For instance, in the
previous line, %eax is 32 bits wide, so the mov instruction is inferred to be movl.
0x0000000100000f60 <main+16>: movl $0x5,-0x8(%rbp)

Now were getting into the meat of our sample program! The first line of assembly is the first line of C in main and
stores the number 5 in the next available local variable slot (%rbp - 0x8), 4 bytes down from our last local variable.
Thats the location of a. We can use GDB to verify this:
(gdb) x &a
0x7fff5fbff768: 0x00000005
(gdb) x $rbp - 8
0x7fff5fbff768: 0x00000005

Note that the memory addresses are the same. Youll notice that GDB sets up variables for our registers, but like all
variables in GDB, we prefix it with a $ rather than the % used in AT&T assembly.
0x0000000100000f67 <main+23>: mov -0x8(%rbp),%ecx
0x0000000100000f6a <main+26>: add $0x6,%ecx
0x0000000100000f70 <main+32>: mov %ecx,-0xc(%rbp)

We then move a into %ecx, one of our general purpose registers, add 6 to it and store the result in %rbp - 0xc. This
is the second line of C in main. Youve maybe figured out that %rbp - 0xc is b, which we can verify in GDB:
(gdb) x &b
0x7fff5fbff764: 0x0000000b
(gdb) x $rbp - 0xc
0x7fff5fbff764: 0x0000000b

The rest of main is just cleanup, called the function epilogue:


0x0000000100000f73 <main+35>: pop %rbp
0x0000000100000f74 <main+36>: retq

We pop the old base pointer off the stack and store it back in %rbp and then retq jumps back to our return address,
which is also stored in the stack frame.
So far weve used GDB to disassemble a short C program, gone over how to read AT&T assembly syntax, and covered
registers and memory address operands. Weve also used GDB to verify where our local variables are stored in relation
to %rbp. Now were going to use our newly acquired skills to explain how static local variables work.
Understanding static local variables
Static local variables are a very cool feature of C. In a nutshell, they are local variables that only get initialized once
and persist their values across multiple calls to the function where they are defined. A simple use case for static local
variables is a Python-style generator. Heres one that generates all of the natural numbers up to INT_MAX:
/* static.c */
#include <stdio.h>
int natural_generator() {
int a = 1;
static int b = -1;
b += 1;
return a + b;
}

int main() {
printf("%d\n", natural_generator());
printf("%d\n", natural_generator());
printf("%d\n", natural_generator());

return 0;
}

When compiled and run, this program prints the first three natural numbers:
$ CFLAGS="-g -O0" make static
cc -g -O0 static.c -o static
$ ./static
1
2
3

But how does this work? To understand static locals, were going to jump into GDB and look at the assembly. Ive
removed the address information that GDB adds to the disassembly so that everything fits on screen:
$ gdb static
(gdb) break natural_generator
(gdb) run
(gdb) disassemble
Dump of assembler code for function natural_generator:
push %rbp
mov %rsp,%rbp
movl $0x1,-0x4(%rbp)
mov 0x177(%rip),%eax # 0x100001018 <natural_generator.b>
add $0x1,%eax
mov %eax,0x16c(%rip) # 0x100001018 <natural_generator.b>
mov -0x4(%rbp),%eax
add 0x163(%rip),%eax # 0x100001018 <natural_generator.b>
pop %rbp
retq
End of assembler dump.

The first thing we need to do is figure out what instruction were on. We can do that by examining the instruction
pointer or program counter. The instruction pointer is a register that stores the memory address of the next instruction.
On x86_64, that register is %rip. We can access the instruction pointer using the $rip variable, or alternatively we
can use the architecture independent $pc:
(gdb) x/i $pc
0x100000e94 <natural_generator+4>: movl $0x1,-0x4(%rbp)

The instruction pointer always contains the address of the next instruction to be run, which means the third instruction
hasnt been run yet, but is about to be.
Because knowing the next instruction is useful, were going to make GDB show us the next instruction every time the
program stops. In GDB 7.0 or later, you can just run set disassemble-next-line on, which shows all the
instructions that make up the next line of source, but were using Mac OS X, which only ships with GDB 6.3, so well
have to resort to the display command. display is like x, except it evaluates its expression every time our
program stops:
(gdb) display/i $pc
1: x/i $pc 0x100000e94 <natural_generator+4>: movl $0x1,-0x4(%rbp)

Now GDB is set up to always show us the next instruction before showing its prompt.
Were already past the function prologue, which we covered earlier, so well start right at the third instruction. This
corresponds to the first source line that assigns 1 to a. Instead of next, which moves to the next source line, well use
nexti, which moves to the next assembly instruction. Afterwards well examine %rbp - 0x4 to verify our
hypothesis that a is stored at %rbp - 0x4.
(gdb) nexti
7 b += 1;
1: x/i $pc mov 0x177(%rip),%eax # 0x100001018 <natural_generator.b>
(gdb) x $rbp - 0x4
0x7fff5fbff78c: 0x00000001
(gdb) x &a
0x7fff5fbff78c: 0x00000001

They are the same, just as we expected. The next instruction is more interesting:
mov 0x177(%rip),%eax # 0x100001018 <natural_generator.b>

This is where wed expect to find the line static int b = -1;, but it looks substantially different than anything
weve seen before. For one thing, theres no reference to the stack frame where wed normally expect to find local
variables. Theres not even a -0x1! Instead, we have an instruction that loads 0x100001018, located somewhere
after the instruction pointer, into %eax. GDB gives us a helpful comment with the result of the memory operand
calculation and a hint telling us that natural_generator.b is stored at this address. Lets run this instruction and
figure out whats going on:
(gdb) nexti
(gdb) p $rax
$3 = 4294967295
(gdb) p/x $rax
$5 = 0xffffffff

Even though the disassembly shows %eax as the destination, we print $rax, because GDB only sets up variables for
full width registers.
In this situation, we need to remember that while variables have types that specify if they are signed or unsigned,
registers dont, so GDB is printing the value of %rax unsigned. Lets try again, by casting %rax to a signed int:
(gdb) p (int)$rax
$11 = -1

It looks like weve found b. We can double check this by using the x command:
(gdb) x/d 0x100001018
0x100001018 <natural_generator.b>: -1
(gdb) x/d &b
0x100001018 <natural_generator.b>: -1
So not only is b stored at a low memory address outside of the stack, its also initialized to -1 before
natural_generator is even called. In fact, even if you disassembled the entire program, you wouldnt find any
code that sets b to -1. This is because the value for b is hardcoded in a different section of the sample executable, and
its loaded into memory along with all the machine code by the operating systems loader when the process is
launched.4
With this out of the way, things start to make more sense. After storing b in %eax, we move to the next line of source
where we increment b. This corresponds to the next two instructions:
add $0x1,%eax
mov %eax,0x16c(%rip) # 0x100001018 <natural_generator.b>

Here we add 1 to %eax and store the result back into memory. Lets run these instructions and verify the result:
(gdb) nexti 2
(gdb) x/d &b
0x100001018 <natural_generator.b>: 0
(gdb) p (int)$rax
$15 = 0

The next two instructions set us up to return a + b:


mov -0x4(%rbp),%eax
add 0x163(%rip),%eax # 0x100001018 <natural_generator.b>

Here we load a into %eax and then add b. At this point, wed expect %eax to be 1. Lets verify:
(gdb) nexti 2
(gdb) p $rax
$16 = 1

%eax is used to store the return value from natural_generator, so were all set up for the epilogue which cleans
up the stack and returns:
pop %rbp
retq

Now we understand how b is initialized, lets see what happens when we run natural_generator again:
(gdb) continue
Continuing.
1

Breakpoint 1, natural_generator () at static.c:5


5 int a = 1;
1: x/i $pc 0x100000e94 <natural_generator+4>: movl $0x1,-0x4(%rbp)
(gdb) x &b
0x100001018 <natural_generator.b>: 0

Because b is not stored on the stack with other local variables, its still zero when natural_generator is called
again. No matter how many times our generator is called, b will always retain its previous value. This is because its
stored outside the stack and initialized when the loader moves the program into memory, rather than by any of our
machine code.

Conclusion
We began by going over how to read assembly and how to disassemble a program with GDB. Afterwards, we covered
how static local variables work, which we could not have done without disassembling our executable.
We spent a lot of time alternating between reading the assembly instructions and verifying our hypotheses in GDB. It
may seem repetitive, but theres a very important reason for doing things this way: the best way to learn something
abstract is to make it more concrete, and one of the best way to make something more concrete is to use tools that let
you peel back layers of abstraction. The best way to to learn these tools is to force yourself to use them until theyre
second nature.
1. Youll notice were using Make to build `simple.c` without a makefile. We can do this because Make has
implicit rules for building executables from C files. You can find more information about these rules in the
[Make manual](http://www.gnu.org/software/make/manual/make.html#Implicit-Rules).
2. You can also have GDB output Intel syntax, which is used by NASM, MASM, and other assemblers, but thats
outside the scope of this post.
3. Processors with SIMD instruction sets like MMX and SSE for x86 and AltiVec for PowerPC will often contain
some registers that are wider than the CPU architecture.
4. A discussion of object formats, loaders, and linkers is best saved for a future blog post.

Prerequisites for this tutorial

NASM
MinGW
MS Windows

Tutorial introduction
This tutorial will be a basic demonstration of calling routines written in assembly from your C code. We will use
NASM to compile our assembly code and GCC to compile our C code.

The assembly code


For the purpose of this tutorial, our assembly routine will be very simple, it will simply add two integers passed as
parameters and return the result.

add.asm
; make the add function visible to the linker
global _add

; prototype: int __cdecl add(int a, int B)/>


; desc: adds two integers and returns the result
_add:
mov eax, [esp+4] ; get the 2nd parameter off the stack
mov edx, [esp+8] ; get the 1st parameter off the stack
add eax, edx ; add the parameters, return value in eax
ret ; return from sub-routine

We are using the __cdecl (default for C/C++) calling convention, so we must take some things into account...
We preserve the stack pointer, this is because the stack is cleaned up by the caller so we don't want to mess around with
the stack pointer inside the routine.
The return value is returned in the EAX register.

Leading underscores are added because by default GCC will add these to function calls. We could use GCC compiler
flags to change this behaviour but for simplicities sake, we will just add them in here.

The C code
The C code will simply be a basic main function that calls our add function

main.c
#include <stdio.h>

/*
* declaring add as extern tells the compiler that the definition
* can be found in a seperate module
*/
extern int add(int a, int B)/>;

int main() {
int ret = add(10, 20);
printf("add returned %d\n", ret);
return 0;
}

Compilation
Compiling the above code requires the use of object files, the format that we will use is ELF since NASM and GCC
both understand this format.

The first thing we must do is compile our assembly code using NASM. We will compile add.asm into an object file
called add.o using the following command.
nasm -f elf -o add.o add.asm

Now you should have an ELF file called add.o in your working directory. Next we need to compile our C code, main.c.
We use GCC to do this using the following command.
gcc -c main.c -o main.o

Now you should have two object files, add.o and main.o. Now for the final step, we will link these object files together
to create our final executable file. To do this, we will use GCC again which in turn will invoke LD for us. The
following command will create our executable.
gcc -o test_asm add.o main.o

Now you should have an executable file asm_test.exe which will call our assembly routine and display the result, that
is.
Quote
add returned 30

Conclusion
Thanks for reading my tutorial, I hope you got something out of it.

Ryan

For those of you compiling on a *nix box, you'll need to add the underscores yourself for compilation:
#include <stdio.h>

/*
* declaring add as extern tells the compiler that the definition
* can be found in a seperate module
*/
extern int _add(int a, int B)/>;

int main() {
int ret = _add(10, 20);
printf("add returned %d\n", ret);
return 0;
}

otherwise you'll get undefined errors for add.

Alternatively, modify the assembly and remove the underscores:


; make the add function visible to the linker
global add

; prototype: int __cdecl add(int a, int B)/>


; desc: adds two integers and returns the result
add:
mov eax, [esp+4] ; get the 2nd parameter off the stack
mov edx, [esp+8] ; get the 1st parameter off the stack
add eax, edx ; add the parameters, return value in eax
ret ; return from sub-routine

nasm 2.07
gcc 4.4.3

S-ar putea să vă placă și