Documente Academic
Documente Profesional
Documente Cultură
not Benchmarks
HiPEAC Computing Systems Week 2012
Ali Saidi, Andreas Hansson
Welcome
Glad youre here!
gem5 has been a multi-year effort
ARM is giving this tutorial today
Bit ARM-ISA focused
Borrowing material from previous gem5 tutorials
Many institutions and companies have contributed to the simulator
Encourage you to do the same
Outline Part 1
Introduction
Compiling
Running
Checkpoints
Sampling
Instrumenting
Results
Trace
Debugging the simulator
Debugging the execution
Basics
Debugging
Outline Part 2
Memory System
Overview
Ports
Transport interfaces
Caches and Coherence
Interconnect components
CPU Models
Simple
InOrder
Out-of-order
Common Tasks
Adding a statistic, SimObject, or Instruction
Conclusion
5
INTRODUCTION
System Simulator
Built from a combination of M5 and GEMS
In doing so we lost all capitalization: gem5
Flexibility
Programmer
View
Huge amounts
of interesting
gem5
work
can be
done in this
area
Profilers &
Dynamic
Instrumentation
Accuracy
Validation
Model
RTL
Envisioned use-cases
SW development and verification
Binary-translation models (e.g. OVP/QEMU) are fast enough to do
this and have a mature SW development environment
10
11
Android Gingerbread
Real Applications
12
Graphical Statistics
13
Memory
L2
L2
Disk
Disk
I/O
I/O
Net
Client
14
L1
L1
Simulated Ethernet
Net
Server
Memory
L1
CPU
Main Goals
Open source tool focused on architectural modeling
Flexibility
Availability
For both academic and corporate researchers
No dependence on proprietary code
BSD license
Collaboration
15
High-level Features
Configurable CPU models
Device Models
Linux, Android
Many ISAs
16
17
18
What is new?
If you havent looked at gem5 recently
ARM & x86 support
Re-worked memory system with TLM-like semantics
Integration with GEMS
SE/FS merged together
Frame buffers and VNC
19
BASICS
20
Building gem5
Platforms
Linux, BSD, MacOS X, Solaris, etc
64 bit machines help quite a bit
Tools
GCC/G++ 4.2+ (or clang 2.9+)
Python 2.4+
SCons 0.98.1+
http://www.scons.org
SWIG 1.3.40+
http://www.swig.org
swig zlib-dev
Compile Targets
build/<isa>/<binary>
ISAs:
ARM, ALPHA, MIPS, SPARC, POWER, X86
Binaries
22
gem5.debug
gem5.opt
gem5.fast
gem5.prof
Sample Compile
21:36:01 [/work/gem5] scons build/ARM/gem5.opt j4
scons: Reading SConscript files ...
Checking for leading underscore in global variables...(cached) yes
Checking for C header file Python.h... (cached) yes
Checking for C library dl... (cached) yes
Checking for C library python2.7... (cached) yes
Checking for accept(0,0,0) in C++ library None... (cached) yes
Checking for zlibVersion() in C++ library z... (cached) yes
Checking for clock_nanosleep(0,0,NULL,NULL) in C library None... (cached) no
Checking for clock_nanosleep(0,0,NULL,NULL) in C library rt... (cached) no
Can't find library for POSIX clocks.
Checking for C header file fenv.h... (cached) yes
Reading SConsopts
Building in /work/gem5/build/ARM
Using saved variables file /work/gem5/build/variables/ARM
Generating LALR tables
WARNING: 1 shift/reduce conflict
scons: done reading SConscript files.
scons: Building targets ...
[ CXX] ARM/sim/main.cc -> .o
[ TRACING] -> ARM/debug/Faults.hh
[GENERATE] -> ARM/arch/interrupts.hh
[GENERATE] -> ARM/arch/isa_traits.hh
[GENERATE] -> ARM/arch/microcode_rom.hh
[ CFG ISA] -> ARM/config/the_isa.hh
23
Running Simulation
21:58:32 [ /work/gem5] ./build/ARM/gem5.opt -h
Usage
=====
gem5.opt [gem5 options] script.py [script options]
gem5 is copyrighted software; use the --copyright option for details.
Options
=======
--version
show program's version number and exit
--help, -h
show this help message and exit
--build-info, -B
Show build information
--copyright, -C
Show full copyright information
--readme, -R
Show the readme
--outdir=DIR, -d DIR
Set the output directory to DIR [Default: m5out]
--redirect-stdout, -r
Redirect stdout (& stderr, without -e) to file
--redirect-stderr, -e
Redirect stderr to file
--stdout-file=FILE
Filename for -r redirection [Default: simout]
--stderr-file=FILE
Filename for -e redirection [Default: simerr]
--interactive, -i
Invoke the interactive interpreter after running the
script
--pdb
Invoke the python debugger before running the script
--path=PATH[:PATH], -p PATH[:PATH]
Prepend PATH to the system path when invoking the
script
24
Running Simulation
Statistics Options
------------------
--stats-file=FILE
Sets the output file for statistics [Default: stats.txt]
Configuration Options
---------------------
--dump-config=FILE
Dump configuration output file [Default: config.ini]
--json-config=FILE
Create JSON output of the configuration [Default: config.json]
Debugging Options
-----------------
--debug-break=TIME[,TIME]
Cycle to create a breakpoint
--debug-help
Print help on trace flags
--debug-flags=FLAG[,FLAG]
Sets the flags for tracing (-FLAG disables a flag)
--remote-gdb-port=REMOTE_GDB_PORT
Remote gdb base port (set to 0 to disable listening)
Trace Options
-------------
--trace-start=TIME
Start tracing at TIME (must be in ticks)
--trace-file=FILE
Sets the output file for tracing [Default: cout]
--trace-ignore=EXPR
Ignore EXPR sim objects
25
26
27
Actual simulation
29
Library of simulation
objects described in Python
Objects
Everything you care about is an object (C++/Python)
Assembled using Python, simulated using C++
Derived from SimObject base class
Common code for creation, configuration parameters, naming,
checkpointing, etc.
Events
Standard discrete-event timing model
Global logical time in ticks
No fixed relation to real time
Constants in src/sim/core.hh always relate ticks to real time
Ports
Used for connecting MemObjects together
MasterPort SlavePort
inst.
instr.
memory
data
data
memory
CPU
32
Transport interfaces
Three transport interfaces: Functional, Atomic, Timing
All have their own transport functions on the ports
sendFunctional(), sendAtomic(), sendTiming()
Functional:
Used for loading binaries, debugging, introspection, etc.
Accesses happen instantaneously
Reads get the newest copy of the data
Writes update data everywhere in the memory system
Completes a transaction in a single function call
Requests complete before sendFunctional() returns
Equivalent to TLM-2 debug transport
Objects that buffer packets must be queried and updated as well
33
Timing:
Statistics
Wide variety of statistics available
Scalar
Average
Vector
Formula
Histogram
Distribution
Vector Distribution
35
Constraints
Original simulation and test simulations must have
Same ISA; number of cores; memory map
We dont currently checkpoint cache state
Checkpoints should be created with Atomic CPU and no caches
36
RUNNING AN EXPERIMENT
37
38
39
40
Statistics Output
[/work/gem5] cat m5out/stats.txt
---------- Begin Simulation Statistics ----------
sim_seconds
0.002038
sim_ticks
2038122000
final_tick
2038122000
sim_freq
1000000000000
host_inst_rate
2581679
host_op_rate
2781442
system.physmem.bytes_read
17774713
system.physmem.bytes_written
656551
system.cpu.numCycles
4076245
system.cpu.committedInsts
2763927
system.cpu.committedOps
2977829
41
42
Stats Output
[/work/gem5] cat m5out/stats.txt
---------- Begin Simulation Statistics ----------
sim_seconds
0.001687
sim_ticks
1686872500
final_tick
1686872500
sim_freq
1000000000000
host_inst_rate
103418
host_op_rate
111421
system.physmem.bytes_read
43968
system.physmem.bytes_written
0
system.cpu.numCycles
4076245
system.cpu.committedInsts
2763927
system.cpu.committedOps
2977829
system.cpu.commit.branchMispredicts
93499
system.cpu.cpi
1.220635
43
Recompile the binary when done:
44
Create a Checkpoint
Command Line:
[/work/gem5] ./build/ARM/gem5.opt configs/example/se.py -c queens o 16
gem5 Simulator System. http://gem5.org
gem5 is copyrighted software; use the --copyright option for details.
**** REAL SIMULATION ****
info: Entering event queue @ 0. Starting simulation...
Writing checkpoint
info: Entering event queue @ 6805000. Starting simulation...
Exiting @ tick 2038122000because target called exit()
Directory:
[/work/gem5] ls m5out
config.ini config.json cpt.6805000 stats.txt
45
---------- End Simulation Statistics ----------
46
}
}
47
Mount command:
[/work/gem5] mount o loop,offset=32256 linux-arm-ael.img /mnt
[/work/gem5] ls /mnt
bin boot dev etc home lib lost+found media mnt proc root sbin sys tmp usr var writable
[/work/gem5] cp queens /mnt
[/work/gem5] cp queens-w-chkpt /mnt
configs/boot/queens.rcS:
#!/bin/sh
# Wait for system to calm down
sleep 10
# Take a checkpoint in 100000 ns
m5 checkpoint 100000
# Reset the stats
m5 resetstats
# Run queuens
/queens 16
# Exit the simulation
m5 exit
49
gem5 Terminal
Default output from full-system simulation is on a UART
m5term is a terminal emulator that lets you connect to it
Code is in src/util/term
Run make in that directory and make install
Binary takes two parameters
./m5term <host> <port>
If youre running it locally, use the loopback interface
127.0.0.1
Port number is printed when gem5 starts
Tries 3456 and increments until it find a free port
So if youre running multiple copies on a single machine you might
find 3457, 3458,
50
51
parameters
config.json json formatted file which is easy to parse for input into
other simulators (e.g. power)
Statistics
stats.txt Youve seen several examples of this
Checkpoints
cpt.<cycle number> -- Each checkpoint has a cycle number. The r N
Output
*.terminal Serial port output from the simulation
frames_<system> Framebuffer output
53
DEBUGGING
54
Debugging Facilities
Tracing
Instruction tracing
Diffing traces
Pipeline viewer
55
Tracing/Debugging
printf() is a nice debugging tool
Example flags:
Fetch, Decode, Ethernet, Exec, TLB, DMA, Bus, Cache, O3CPUAll
Print out all flags with debug-help
--debug-flags=Exec
--trace-start=30000
--trace-file=my_trace.out
Enable the flag Exec; start at tick 30000; Write to my_trace.out
56
my_trace.out:
2:44:47 [ /work/gem5] head m5out/my_trace.out
50000:
system.cpu:
Decode:
Decoded cmps instruction:
50500:
system.cpu:
Decode:
Decoded ldr instruction:
51000:
system.cpu:
Decode:
Decoded ldr instruction:
51500:
system.cpu:
Decode:
Decoded ldr instruction:
52000:
system.cpu:
Decode:
Decoded addi_uop instruction:
52500:
system.cpu:
Decode:
Decoded cmps instruction:
53000:
system.cpu:
Decode:
Decoded b instruction:
53500:
system.cpu:
Decode:
Decoded sub instruction:
54000:
system.cpu:
Decode:
Decoded cmps instruction:
54500:
system.cpu:
Decode:
Decoded ldr instruction:
57
0xe353001e
0x979ff103
0xe5107004
0xe4903008
0xe4903008
0xe3530000
0x1affff84
0xe2433003
0xe353001e
0x979ff103
MyNewFlag.hh
Instruction Tracing
Separate from the general debug/trace facility
But both are enabled the same way
Per-instruction records populated as instruction executes
Start with PC and mnemonic
Add argument and result values as they become known
Printed to trace when instruction completes
Flags for printing cycle, symbolic addresses, etc.
2:44:47 [ /work/gem5] head m5out/my_trace.out
50000:
T0 : 0x14468
: cmps r3, #30
: IntAlu : D=0x00000000
50500:
T0 : 0x1446c
: ldrls pc, [pc, r3 LSL #2]
: MemRead : D=0x00014640 A=0x14480
51000:
T0 : 0x14640
: ldr r7, [r0, #-4]
: MemRead : D=0x00001000 A=0xbeffff0c
51500:
T0 : 0x14644.0
: ldr r3, [r0] #8
: MemRead : D=0x00000011 A=0xbeffff10
52000:
T0 : 0x14644.1
: addi_uop r0, r0, #8
: IntAlu : D=0xbeffff18
52500:
T0 : 0x14648
: cmps r3, #0
: IntAlu : D=0x00000001
53000:
T0 : 0x1464c
: bne
: IntAlu :
59
60
Diffing Traces
Often useful to compare traces from two simulations
but you really dont want to run the simulation to completion first
util/rundiff
util/tracediff
63
State trace
64
cycle-by-cycle to gem5
Supports ARM, x86, SPARC
See wiki for more information
Checker CPU
Runs a complex CPU model such as the O3 model in tandem
with a special Atomic CPU model
65
Remote Debugging
./build/ARM/gem5.opt configs/example/fs.py
gem5 Simulator System
...
command line: ./build/ARM/gem5.opt configs/example/fs.py
Global frequency set at 1000000000000 ticks per second
info: kernel located at: /dist/binaries/vmlinux.arm
Listening for system connection on port 5900
Listening for system connection on port 3456
0: system.remote_gdb.listener: listening for remote gdb #0 on port 7000
info: Entering event queue @ 0. Starting simulation...
66
Remote Debugging
GNU gdb (Sourcery G++ Lite 2010.09-50) 7.2.50.20100908-cvs Copyright (C)
2010 Free Software Foundation, Inc."
..."
(gdb) symbol-file /dist/binaries/vmlinux.arm
Reading symbols from /dist/binaries/vmlinux.arm...done.
(gdb) set remote Z-packet on"
(gdb) set tdesc filename arm-with-neon.xml"
(gdb) target remote 127.0.0.1:7000
Remote debugging using 127.0.0.1:7000"
cache_init_objs (cachep=0xc7c00240, flags=3351249472) at mm/slab.c:2658
(gdb) step"
sighand_ctor (data=0xc7ead060) at kernel/fork.c:1467"
(gdb) info registers
r0 0xc7ead060
-940912544
r1 0x5201312
r2 0xc002f1e4
-1073548828
r3 0xc7ead060
-940912544
r4 0x00
r5 0xc7ead020
-940912608
67
Python Debugging
It is possible to drop into the python interpreter (-i flag)
This currently happens after the script file is run
If you want to do this before objects are instantiated, remove
them from script
It is possible to drop into the python debugger (--pdb flag)
Occurs just before your script is invoked
Lets you use the debugger to debug your script code
Code that enables this stuff is in src/python/m5/main.py
At the bottom of the main function
Can copy the mechanism directly into your scripts, if in the wrong
68
O3 Pipeline Viewer
Use --debug-flags=O3PipeView and util/o3-pipeview.py
69
MEMORY SYSTEM
70
Goals
Model a system with heterogeneous applications, running on
a set of heterogeneous processing engines, using
heterogeneous memories and interconnect
Processor
Processor
Processor
CPU
Video
backend
Video
decoder
GPU
GPU
GPU
GPU
Interconnect
Interconnect
DRAM
DRAM
DRAM
71
3D-DRAM
SRAM
DMA
NAND
NAND
PCM
STT-RAM
Goals, contd.
Two worlds...
Computation-centric simulation
e.g. SimpleScalar, Simics, Asim etc
More behaviourally oriented, with ad-hoc ways of describing
Slave module
Interconnect
module
I$
CPU
memory0
bus
memory1
D$
Master port
73
Slave port
CPU
...
delete resp_pkt;
74
memory
if (req_pkt->needsResponse()) {
req_pkt->makeResponse();
} else {
delete req_pkt;
}
...
transaction
Virtual/physical addresses, size
MasterID uniquely identifying the module behind the request
Stats/debug info: PC, CPU, and thread ID
Requests are transported as Packets
Command (ReadReq, WriteReq, ReadResp, etc.) (MemCmd)
Address/size (may differ from request, e.g., block aligned cache miss)
Pointer to request and pointer to data (if any)
Source & destination port identifiers (relative to interconnect)
Used for routing responses back to the master
Always follow the same path
SenderState opaque pointer
Enables adding arbitrary information along packet path
75
masterPort.sendFunctional(pkt);
// packet is now a response
MySlavePort::recvFunctional(PacketPtr pkt)
{
...
CPU
76
memory
For a slave module, perform any state updates and turn the request into a
response
For an interconnect module, perform any state updates and forward the
request through the appropriate master port using sendAtomic
CPU
77
MySlavePort::recvAtomic(PacketPtr pkt)
{
...
return latency;
}
memory
overloading recvTiming
Perform state updates and potentially forward request packet
For a slave module, typically schedule an action to send a response at a later time
A slave port can choose not to accept a request packet by returning false
The slave port later has to call sendRetry to alert the master port to try again
bool success = masterPort.sendTiming(pkt);
if (success) {
// request packet is sent
...
} else {
// failed, will get
// retry from slave port
...
}
CPU
78
MySlavePort::recvTiming(PacketPtr pkt)
{
assert(pkt->isRequest());
...
return true/false;
}
memory
overloading recvTiming
Perform state updates and potentially forward response packet
For a master module, typically schedule a succeeding request
A master port can choose not to accept a response packet by returning
false
The master port later has to call sendRetry to alert the slave port to try again
MyMasterPort::recvTiming(PacketPtr pkt)
{
assert(pkt->isResponse());
...
return true/false;
}
CPU
79
memory
Detailed statistics
e.g., Request size/type distribution, state transition frequencies, etc...
Detailed component simulation
Network (fixed/flexible pipeline and simple)
Caches (Pluggable replacement policies)
Runs with Alpha and X86
Limited support for functional accesses
80
Caches
Single cache model with several components:
Cache: request processing, miss handling, coherence
Tags: data storage and replacement (LRU, IIC, etc.)
Prefetcher: N-Block Ahead, Tagged Prefetching, Stride Prefetching
MSHR & MSHRQueue: track pending/outstanding requests
Also used for write buffer
Parameters: size, hit latency, block size, associativity, number of
MSHRs (max outstanding requests)
81
Coherence protocol
MOESI bus-based snooping protocol
Support nearly arbitrary multi-level hierarchies at the expense of
some realism
82
83
Memory
All memories in the system inherit from AbstractMemory
Encapsulates basic memory behaviour:
Has an address range with a start and size
Can perform a zero-time functional access and normal access
SimpleMemory is currently the only subclass
Multi-port memory controller
Fixed-latency memory (possibly with a variance)
Infinite throughput
84
Work in progress
4-phase handshakes like TLM-2
Begin/end request/response
Enable straight forward modeling of contention and arbitration
Add associated library of general arbiters for shared buses and
memory controllers
Communication monitor
Insert as a structural component where stats are desired
Captures a wide range of communication stats: bandwidth, latency,
inter-transaction time, outstanding transactions etc
Can be found on the review board
Traffic generators
Inject requests based on probabilistic state-transition diagrams
Black-box IP models or predictable scenarios for memory system
testing and performance validation
85
CPU MODELS
87
CPU
88
Memory
Decoder
AtomicSimpleCPU
TLB
TimingSimpleCPU
Faults
InOrder CPU
Interrupts
O3 CPU
Classic
Ruby
89
benchmark
Warming Up Caches
Studies that do not require detailed CPU modeling
90
91
92
93
stage
If an instruction cant complete all its resource requests in one stage,
it blocks the pipeline
94
ThreadContexts
Interface for accessing total architectural state of a single
thread
PC, register values, etc.
95
Instruction Decoding
Memory
Byte
Byte
Byte
Byte
Byte
Predecoder
Byte
Byte
Context
ExtMachineInst
Decoder
StaticInst
96
Macro-op
Byte
StaticInst
Represents a decoded instruction
Has classifications of the inst
Corresponds to the binary machine inst
Only has static information
97
DynInst
Dynamic version of StaticInst
Used to hold extra information detailed CPU models
BaseDynInst
Holds PC, Results, Branch Prediction Status
Interface for TLB translations
98
99
COMMON TASKS
100
Common Tasks
Adding a statistic
Parameters and SimObject
Creating an SimObject
Configuration
Initialization
Serialization
Events
Instrumenting a benchmark
101
Adding a statistic
Add a statistic to the atomic CPU model
Track number of instruction committed in user mode
Number of statistics classes in gem5
Scalar, Average, Vector, Formula, Histogram, Distribution, Vector Dist
Well choose a Scalar and a Formula
Count number of instructions in user mode
Formula to print percentage out of total
102
103
numUserInsts
.name(name() + ".committedUserInsts")
.desc("Number of instructions committed
while in use code)
;
percentUserInsts
.name(name() + .percentUserInsts")
.desc(Percent of total of instructions
committed while in use code)
;
"
idleFraction = constant(1.0) - notIdleFraction;
percentUserInsts = numUserInsts/numInsts;
104
Accumulate numUserInsts
void countInst()
{
if (!curStaticInst->isMicroop() || curStaticInst->isLastMicroop()) {
numInst++;
numInsts++;
if (TheISA::inUserMode(tc))
numUserInsts++;
}
}
105
Stats:
[/work/gem5] grep Insts m5out/stats.txt
system.cpu.committedInsts
59262896
system.cpu.committedUserInsts 6426560
system.cpu.percentUserInsts
0.108442
106
107
Parameter default
Parameter Description
class Pl011(Uart):
type = 'Pl011'
gic = Param.Gic(Parent.any, )
int_num = Param.UInt32()
end_on_eot = Param.Bool(False, "End )
int_delay = Param.Latency("100ns", "Time ")
109
Creating a SimObject
Derive Python class from Python SimObject
Recompile
110
SimObject Initialization
SimObjects go through a sequence of initialization
1. C++ object construction
Other SimObjects in the system may not be constructed yet
2. SimObject::init()
Called on every object before the first simulated cycle
Useful place to put initialization that requires other SimObjects
3. SimObject::initState()
Called on every SimObject when not restoring from a checkpoint
4. SimObject::loadState()
Called on every SimObject when restoring from a checkpoint
By default the implementation calls SimObject::unserialize()
111
Creating/Using Events
One of the most common things in an event driven simulator
is scheduling events
Declaring events and handlers is easy:
112
114
Instrumenting a Benchmark
You can add instructions that tell simulator to take action
115
CONFIGURATION
116
Simulator Configuration
Config files that come with gem5 are meant to be examples
Certainly not meant to expose every parameter you would want to
change
117
SimObject Parameters
Parameters can be
A Simple Example
import m5
from m5.objects import *
class MyCache(BaseCache):
assoc = 2
block_size = 64
latency = '1ns'
mshrs = 10
tgts_per_mshr = 5
class MyL1Cache(MyCache):
is_top_level = True
cpu = TimingSimpleCPU(cpu_id=0)
cpu.addTwoLevelCacheHierarchy(MyL1Cache(size = '128kB'),
MyL1Cache(size = '256kB'),
MyCache(size = '2MB', latency='10ns'))
system = System(cpu = cpu,
physmem =SimpleMemory(),
membus = Bus())
119
121
CONCLUSION
122
Summary
Basics of using gem5
High-level features
Running simulations
Debugging them
Memory system
CPU Models
Common Tasks
Adding a statistics
SimObject Parameters
Creating a SimObject
Instrumenting a Benchmark
123
Keep in Touch
Please check out the website:
We hope you found this tutorial and will find gem5 useful
Wed love to work with you to make gem5 more useful to the
community
Thank you
124