Sunteți pe pagina 1din 13

Term paper

Disassemblers

NAME- Jyoti

CLASS- B.Tech C.S.C

ROLLNO- A15

REG.NO-10901555

SUBMITTED TO - Er. Harjit Singh


Acknowledgement
As usual large number of people deserves my thanks for the help they provided me
for the preparation for this term paper.

First of all I would like to thanks my teacher Mr. Harjeet Singh for her support
during the preparation of this topic. I am very thankful for her guidance.

Information about the topic they provided to me during my effort to prepare this
topic.

INTRODUCTION

A disassembler is a computer program that translates machine language into


assembly language—the inverse operation to that of an assembler. A disassembler
differs from a decompiler, which targets a high-level language rather than an
assembly language. Disassembly, the output of a disassembler, is often formatted
for human-readability rather than suitability for input to an assembler, making it
principally a reverse-engineering tool.

A disassembler is a computer program that translates machine language into


assembly language—the inverse operation to that of an assembler. A disassembler
differs from a decompiler, which targets a high-level language rather than an
assembly language.

A computer program that examines another computer program and attempts to


generate assembly language source code that would, in theory, reproduce the target
program.

A system of nanomachines able to take an object apart a few atoms at a time, while
recording its structure at the molecular level.

Assembly language source code generally permits the use of constants and
programmer comments. These are usually removed from the assembled machine
code by the assembler. If so, a disassembler operating on the machine code would
produce disassembly lacking these constants and comments; the disassembled
output becomes more difficult for a human to interpret than the original annotated
source code. Some disassemblers make use of the symbolic debugging information
present in object files such as ELF. The Interactive Disassembler allow the human
user to make up mnemonic symbols for values or regions of code in an interactive
session: human insight applied to the disassembly process often parallels human
creativity in the code writing process.

Disassembly is not an exact science: On CISC platforms with variable-width


instructions, or in the presence of self-modifying code, it is possible for a single
program to have two or more reasonable disassemblies. Determining which
instructions would actually be encountered during a run of the program reduces to
the proven-unsolvable halting problem.

Types of disassemblers

Any interactive debugger will include some way of viewing the disassembly of the
program being debugged. Often, the same disassembly tool will be packaged as a
standalone disassembler distributed along with the debugger. For example,
objdump, part of GNU Binutils, is related to the interactive debugger gdb.

• IDA
• ILDASM is a tool contained in the .NET Framework SDK. It can be used to
disassemble PE files containing Common Intermediate Language code.
• OllyDbg is a 32-bit assembler level analysing debugger
• PVDasm is a Free, Interactive, Multi-CPU disassembler.
• SIMON, a test/debugger/animator with integrated dis-assembler for
Assembler, COBOL and PL/1
• Texe -- is a Free, 32bit disassembler and windows PE file analyzer.
• unPIC is a disassembler for PIC microcontrollers

CODE ANALYSIS

Compiled program is saved into executable file. There are several different formats
of executable files. Some of them are usable only for some operating systems.
Generally in every executable file there are several sections. Some sections contain
instructions, some
contain data, constant data etc.
In disassembler it is important to distinguish two types of sections. The first
type is data section, the second is executable section—section containing
instructions for processor. Data section is disassembled into simple output of its
content, which can be either in hexadecimal text format or in binary format.

DESIGN OF DISASSEMBLER

Disassembler consists of three main parts. The first part solves the access to the
sections of input file, the second part is a symbol table and the last one deals with
instruction decoding according to instruction sets. We used design patterns [1] and
UML (Unified Modeling Language [2]) during object-oriented design of this
program.

Interactive Disassembler
The Interactive Disassembler, more commonly known as simply IDA, is a
disassembler used for reverse engineering. It supports a variety of executable
formats for different processors and operating systems. It also can be used as a
debugger for Windows PE, Mac OS X Mach-O, and Linux ELF executables. A
decompiler plugin for programs compiled with a C/C++ compiler is available at
extra cost. The latest full version of Ida Pro is commercial software; an earlier and
less capable version is available for download free of charge (version 4.9 as of
May 2010)[2].

IDA performs much automatic code analysis, using cross-references between code
sections, knowledge of parameters of API calls, and other information. However
the nature of disassembly precludes total accuracy, and a great deal of human
intervention is necessarily required; IDA has interactive functionality to aid in
improving the disassembly. A typical IDA user will begin with an automatically
generated disassembly listing and then convert sections from code to data and
viceversa, rename, annotate, and otherwise add information to the listing, until it
becomes clear what it does.

Created as a shareware application by Ilfak Guilfanov, IDA was later sold as a


commercial product by DataRescue, a Belgian company, who improved it and sold
it under the name IDA Pro. In 2007 Guilfanov founded Hex-Rays to pursue the
development of the Hex-Rays Decompiler IDA extension. In January 2008 Hex-
Rays assumed the development and support of Datarescue's IDA Pro.
Scripting in Disassemblers

"IDC scripts" make it possible to extend the operation of the disassembler. Some
helpful scripts are provided, which can serve as the basis for user written scripts.
Most frequently scripts are used for extra modification of the generated code. For
example, external symbol tables can be loaded thereby using the function names of
the original source code. There are websites devoted to IDA scripts and offer
assistance for frequently arising problems.

Users have created plugins that allow other common scripting languages to be used
instead of, or in addition to, IDC. IdaRUB supports Ruby and IDAPython adds
support for Python. As of version 5.4, IDAPython (dependent on Python 2.5)
comes preinstalled with IDA Pro.

XDASM - Universal Cross Disassembler

XDASM is a powerful, MS/DOS® based Program Disassembler which is used to


reconstruct or debug source level code for various processor types. Its unique
table-driven structure and output format adaptability, makes XDASM the most
universal program disassembler available.

New, in Version 3.3, more TAG file commands. Substitute assigned Labels using
your own label name. Add your own comments to each Instruction Line. Up to five
Cross-Reference lists can be generated. More disassembly options. More processor
tables. XDASM 2.x customers can upgrade to version 3.3 for $49.

XDASM - Cross-Disassembler V3.3


• Generates assembly language source code from ROM/EPROM
• Accepts Intel hex, Motorola S and binary file formats
• Creates "Assembler-ready" code for your favorite assembler
• Uses manufacturer's assembly language mnemonics
• User configurable assembler directives
• Creates labels, comment hexdump and cross-reference lists
• Deblocks source code into subroutines
• User can substitute Label Names
• User can insert Instruction Line comments
• Full control of disassembly with TAG file
• Users may create tables for other processor types
• Maximum input file size of 64K (0-65535)
• Source code for all CPU tables provided

• Requires MS/DOS® PC with at least 640K RAM

Contents Required For Hardware:

Supports, At Least, The Following Processors


1802-1806 65816 8031-8040 89700 TMS7000

3870 6800-6809 8048-8052 COP400 TMS9900/95

4004 68HC08 8080/8085 COP800 TMS320C1x

6301/6303 68HC11 8086/8088 PIC16C5x TMS320C2x

64180 78C1x 8096/80196 SUPER8 Z8

It basically converts the byte codes of a class file i.e. machine code of JVM to its
actual source code i.e. the java program. The Dissassembler is not a separate
program. A few function which are added in our main program which reads the
class file. Thus this is the right time to understand what these functions do.

Limitations
It only displays variables & their initializations, airthematic operations on these
variables ,strings, for loop, if loop, nested if & for loops, methods, their signatures.
But you can very well incorporate other unicodes in the methods to convert into
java code. This is an attempt made by us to make people understand how a java
Dissassembler can be written.

Functions

Function: main()

This is the entry point into the disassembler. It does the following.

• Opens (for reading) a file specified on the command line.

• Reads the first 2 bytes from the file to determine the .ORIG address (more
on this later).

• Outputs a .ORIG assembler directive with the address computed above.

• Until the end-of-file is reached, reads each 2-byte instruction from the file
and calls print_instruction() on it.

• Outputs a .END assembler directive.

• Closes the file.

The code for this function is pretty simple, so we provide it.

Function: get_zext_field(int bits, int hi_bit, int lo_bit)

This function gets the value of the bit field in integer bits beginning with bit hi_bit
and ending with bit lo_bit. The resulting value is zero-extended. For example, to
get the opcode of an instruction in ir, we would call this function as follows.

opcode = get_zext_field(ir, 15, 12)

Note that hi_bit and lo_bit are zero-based (i.e., they must be between 0 and 15).
This code is really quite tricky, so we've provided it for you. Please look at the
code and try to understand the logic.
Function: get_sext_field(int bits, int hi_bit, int lo_bit)

This function is very similar to get_zext_field(), except that it sign extends the
resulting field. You will want to use get_zext_field() to select unsigned values like
opcodes or register fields (e.g., DR, SR1, etc.), but you will want to use
get_sext_field() to select signed immediate fields (e.g., imm5) or signed PC offset
fields (e.g., PCoffset9). This code is also tricky. We provide this code, but take a
look at it in order to understand it.

Function: get_bit(int bits, int bit_number)

This function is similar to get_zext_field() except that it selects and returns a single
zero-extended bit. In fact, it's implemented by calling get_zext_field() with hi_bit
and lo_bit set to the same value (bit_number). We provide this code.

Function: get_word_from_file(FILE* f)

This function extracts the next 16-bit word from the input file. We provide this
code.

Function: print_instruction(int pc, int ir)

This is the core of the disassembler. This function is passed two things: (i) an
integer (pc) that may have a value from 0x000 to 0xffff, representing an address in
the LC-3 machine and (ii) an integer (ir) that may have a value from 0x0000 to
0xffff, representing an LC-3 instruction. The instruction ir is located in memory at
address pc (the pc value is useful for computing pc-relative addresses). This
function calls get_zext_field() to extract the opcode from the instruction. It then
switches on that opcode. Within the switch there is a case for each opcode (e.g.,
ADD, AND, BR, JMP, etc.). Each case examines additional instruction bits
(determined by the opcode) and prints an appropriate string representing the
instruction.

For example, in the case for the ADD instruction, we must call
get_zext_field(ir,11,9) to get the destination register and get_zext_field(ir,8,6) to
get the first source operand register. Next it must examine bit 5 (via get_bit(ir,5))
in order to determine whether the final operand is an immediate or register. If bit 5
is 0 (i.e., register operand), we call get_zext_field(ir,4,3) and we check that the
result is 0 (i.e., bits 4 and 3 are 0). If bits 4 and 3 are not 0, this is not a legal ADD
instruction, so we call print_fill(ir) to generate a .FILL assembler directive for this
word. Otherwise, we use get_zext_field(ir,2,0) to get the second source operand
register. Finally, the ADD assembly instruction is printed via printf(). If bit 5 is 1,
we use get_sext_field(ir,4,0) to get the imm5 field, and we print the ADD
instruction. Some of this code is provided to get you started.

Function: print_fill(int ir)

This function prints a .FILL assembler directive.

How It Works

Getting started. Begin by creating a directory to work in and copying the files we
provide.

cd ~

mkdir cse240hw8

cd cse240hw8

cp ~cse240/project/hw/hw8/*

This will give you a bunch of .asm files to use in testing (below). Also, it will give
you a file called lc3dis.c to use as a starting point. You'll want to update your path
just as you did in homework 6 (and 7). This will allow you to access lc3as for
testing.

Output. Note that the output of your disassembler is not another file. It simply
prints the disassembled instructions to the display. If you want to redirect the
output to a file use > as follows.

./lc3dis foo.obj > newfoo.asm


Resources. Appendix A and the table on the inside back cover of your textbook
will be extremely useful! You will find all answer there!

Immediate fields. Please output all of your immediate fields in decimal (rather
then hexadecimal). This is necessary so our automatic testing scripts will not get
confused. For example, the following is fine.

LDR R1, R2, #10

While the following is equivalent to the above, it will not be accepted by our
testing scripts.

LDR R1, R2, xA

Make sure you check the fixed fields in instructions. For example, in an ADD
immediate instruction, bits 4 and 3 must be 0. If they are not, it is not an ADD
instruction at all. It's not any instruction, so it must be data. Similarly, in a JMP
instruction, bits 5 to 0 must be 0. And in a NOT instruction, bits 5 to 0 must be 1. It
you discover that you are looking at data (not an instruction), call print_fill() to
generate a .FILL assembler directive.

One or more of the n, z, or p fields in a BR instruction must be set. If none of them


are, it is not a BR instruction (i.e., it must be data and print_fill() should be called).

PC-relative offsets. Do not try to generate assembly code that contains labels!
This would make things much harder. Instead, simply specify your PC-relative
offsets directly (in base 10, so you can specify negatives). For example, if the PC-
offset of some LD instruction is -17 (and the destination register is R1), you would
generate the following assembly instruction.

LD R1, #-17
Compiling. Use gcc on the Moore 100 machines or eniac-l.seas.upenn.edu to build
your code. You may want to use the -o flag to specify the name of the generated
program. Here's an example.

gcc -o lc3dis lc3dis.c

./lc3dis foo.obj > newfoo.asm

Object file format. For the curious, we'll describe the .obj file format. The first 2
bytes contain the .ORIG address of the program. Subsequent byte pairs (16 bits)
encode each instruction in the program.

Testing
We will provide a number of .asm files you can use to test your disassembler (but
your should also generate your own test cases). First, assemble each of these files
with the as command in the simulator. Then disassemble them with your lc3dis.
Save the output of lc3dis in a file via redirection.
as t1.asm (in simulator)
./lc3dis t1.obj >newt1.asm
Now in order to confirm that your code is correct use the Unix diff utility to
compare t1.asm and newt1.asm.
Diff -w -i t1.asm newt1.asm
If diff produces no output, the two files are the same. Note that -w instructs diff to
ignore whitespace and -i instruct it to ignore case. If the files are different, diff will
indicate how they are different (type "man diff" for more information on diff).

Note that if your original .asm file contained labels, these naturally won't appear in
the corresponding disassembled code. You'll have to confirm that the addresses
your disassembler generates are correct.

Also note that the output of the disassembler cannot be directly assembled because
the assembler doesn't know what to do with absolute addresses (it wants labels).

Features
• Symbol dictionaries of the Rom names and global symbols (0 - $B00) along with
value to symbol substitution in appropriate places.

• Selective list of procedures in a file by procedure name or substring.

• Ability to place the disassembled output on a file in assembler listing output or


assembler input format (MDS “.list” or “.asm” format).

• References to the symbols are collected and may be selectively viewed.

• Ability to search the program file for references to selected address’s, trap (rom)
calls, resource type references, constant or string references.

• Ability to translate the segment relative address of an instruction to the disk file
relative address for code patching purposes.

• MacNosy records its input on a “.jrnl” file (in text format) for later playback. This
feature is used as an educational tool and as a medium of communication between
developers, hackers, etc.

• Ability to reformat data in its “natural format” via directives. This is in addition
to the automatic recognition of various character string formats.

• A full or selective listing of the resources in a file. Format is similar to that of the
Resource Mover, but you get more information with less work.

• A built-in mini editor to view files without leaving Nosy.

Facts and Specifications


It is capable of disassembling the resource fork of any application file, ROM, Macs
bug, and various resource types in the System file (DRVR, PACK, INIT, CDEF,
WDEF, etc). Note that source listing of the WDEF, CDEF procedures come with
the MacSupplement.

References
• www.google.com
• www.wikipedia.com
• www.debian-administrator.org/articles/492--disassemblers
• http://www.turboexplorer.com/cpp