Zen Architecture

Zen Architecture
A Seminar Report
Submitted to the Faculty of Engineering and Technology.

Bachelor of Technology, Computer Science and Engineering
V Semester
(Autonomous Batch)
by
K.Jeevan Reddy
B17CS110
Under the Guidance of

B.Raju
Assistant Professor, Dept of C.S.E
DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

KAKATIYA INSTITUTE OF TECHNOLOGY & SCIENCE, WARANGAL
(An Autonomous Institute under Kakatiya University, Warangal, Telangana)
Academic Year: 2019-2020
(An Autonomous Institute under Kakatiya University,Warangal)
DEPARTMENT OF COMPUTER SCIENCE ANDENGINEERING
CE RT I FI C AT E
This is to certify that K.Jeevan Reddy bearing Roll No. B17CS110

of the V semester B.Tech. Computer science and Engineering (Autonomous) has
satisfactorily completed the seminar entitled “Zen Architecture” in the partial fulfillment
for qualifying the seminar work course during the academic year 2019-2020
Counselor
B.RAJU
Seminar Coordinator Head of the Department
S.Venkatramulu V.Shankar
Dept of C.S.E Dept of C.S.E
ACKNOWLEDGEMENT
I extend my sincere and heartfelt thanks to our esteemed counselor B.Raju,

Assistant Professor and for his exemplary guidance, monitoring and constant
encouragement throughout the course at crucial junctures and for showing us the right way.
I am grateful to respected coordinator S. Venkatramulu, Associate Professor for

permitting me to utilize all the necessary facilities of the Institute.
I would like to extend thanks to our respected Head of the department, Dr. V.Shankar,
Professor for allowing us to use the facilities available. We would like to thank other
faculty members also .
Last but not the least, I would like to thank our friends and family for the support
and encouragement they have given us during the course of our work.
K.Jeevan Reddy
B17CS110
ABSTRACT
Zen is the codename for a computer processor microarchitecture from AMD, and was
first used with their Ryzen series of CPUs in February 2017. The first Zen-based preview
system was demonstrated at E3 2016, and first substantially detailed at an event hosted
a block away from the Intel Developer Forum 2016. The first Zen-based CPUs
codenamed "Summit Ridge" reached the market in early March 2017, Zen-
derived Epyc server processors launched in June 2017 and Zen-based APUs arrived in
November 2017.
Zen is a clean sheet design that differs from the long-standing Bulldozer architecture.
Zen-based processors use a 14 nm FinFET process, are reportedly more energy efficient,
and can execute significantly more instructions per cycle. SMT has been introduced,
allowing each core to run two threads. The cache system has also been redesigned,
making the L1 cache write-back. Zen processors use three different sockets: desktop
and mobile Ryzen chips use the AM4 socket, bringing DDR4 support; the high-end
desktop Zen-based Threadripper chips support quad-channel DDR4 RAM and offer 64
PCIe 3.0 lanes (vs 24 lanes), using the TR4 socket and Epyc server processors offer 128
PCI 3.0 lanes and octal-channel DDR4 using the SP3 socket. But not all Socket AM4 CPUs
are based on Zen microarchitecture (the 7th gen APUs and Athlon X4s are based
on Excavator microarchitecture).
Contents
● What is Ryzen?
● History
● Features
● Zen Architecture
● SenseMI Technology
● Master Software
● Benchmarks
The Ryzen Chip
What is Ryzen ?
CPU chip family released by AMD in 2017, which uses their latest architecture called Zen.
AMD has released Ryzen 7 and Ryzen 5 families described below, and they plan to release another
one called Ryzen 3.
What it is u sed for:

General Purpose Processor, the CPU fam ily have different tiers Ryzen 7, 5 that are currently
on the m arket and Ryzen 3
com ing to m arket later in the year. The three tiers are priced from high to low respectively, Ryzen 7
aim ed for the people looking for high CPU performance for tasks like high spec gam ing and video
and photo editing, Ryzen 5 aim ed for people who are just interested in having a regular computer
that can be used by m ainstream users that will use it for work tasks and web browsing; Ryzen 3
will probably be a bare bones CPU that will run an OS effectively.
Som e specifics about t he chip it self:

It uses a 14nm process, manufactured by GlobalFoundries. Its goal was to bit a 40%
increase in IPC from the previous Excavator generation; but it accom plished to increase its IPC
(Instructions per cycle) by 52%, and keeping the sam e energy used, increasing its efficiency
substantially.
History
Design work on the microarchitecture began

in 2012 and was com pleted four years later. It
began with the hiring of Jim Keller who has worked
for Apple in the design of the A4 and A5
processors. The very first products based on the
brand new CPU core design are the Ryzen
processors.
Zen Architecture Highlights
● Two threads per core (SMT - Simultaneous Multithreading )
● All-new Micro-op Cache
● Up to 20MB Unified Cache
● Two AES units for security (Advanced Encryption Standard)
● High-precision MHz Boost
● High efficiency FinFET transistors

Cache
Write back L1 cache
Faster L2 and L3 cache
7 cycles to load to FPU
Alm ost twice the L1 and L2 bandwidth
Up to 5x the L3 bandwidth
Pipelining
Each Ryzen core has 1 Floating-point unit and 1

integer-unit.
Each Integer unit has 6 pipes, 4 ALUs(Arithm
etic Logic Unit) and 2 AGUs(Address Generation
Unit). These AGUs
can perform two 16-byte loads and one 16-byte
store per cycle via a 32 KB 8-way set associative write-
back L1 data cache. The floating point unit is capable
of performing two FMAC operations or a single 256-
bit AVX operation per cycle.
Can decode four instructions per cycle.

Can deliver 6 operations per cycle to schedulers.
load/ store can perform two 16 byte loads and one
16 byte store per cycle.
Clock Domains
Zen is divided into a number of clock domains, each operating at a certain frequency:
 UClk - UMC Clo ck

The frequency at wh ich the Un ified Mem ory
Contro ller's (UMC) operates at. This frequency
is id entica l to Mem Clk.
 LClk - Lin k Clo ck
Th e clock at wh ich the I/ O Hub Contro lle r
com m unicates with the ch ip.
 FClk - Fa bric Clock
Th e clock at wh ich the data fabric operates
at. This frequency is identica l to Mem Clk.
 Mem Clk - Mem ory Clock
Internal and external memor y clock.
 CClk - Co re Clo ck
Th e frequency at wh ich the CPU core and the
caches operate at (i.e . advertised frequency)
Address Generation Unit
The Address Generation Unit (AGU) is one of three execution units on the DSP56300
core. The AGU performs the effective address calculations (using integer arithmetic)
necessary to address data operands in memory and contains the registers used to generate
the addresses. To minimize address-generation overhead, the AGU operates in parallel
with other chip resources. It implements four types of arithmetic:
Linear
Modulo
Multiple wrap-around modulo
Reverse-carry
4.1 AGU Architecture

The AGU is divided into halves, each with its own Address Arithmetic Logic Unit
(Address ALU). Each Address ALU has four sets of register triplets, and each register
triplet is composed of an address register, an offset register, and a modifier register. The
two Address ALUs are identical. Each contains a 24-bit full adder—an offset adder—
which can perform the following additions/subtractions on an address register:
Plus one
Minus one
Plus the contents of the respective offset register N
Minus the contents of the respective offset register N
A second full adder—a modulo adder—adds the summed result of the first full adder to a
modulo value, M or minus M, where M is stored in the respective modifier register. A
third full adder—a reverse-carry adder—can perform the following additions, with the
carry propagating in the reverse direction (that is, from the Most Significant Bit (MSB) to
the Least Significant Bit (LSB):
Plus one
Minus one
AGU Architecture
The offset N (stored in the respective offset register)

Minus N to the selected address register
The offset adder and the reverse-carry adder operate in parallel and share common inputs.
The only difference between them is that the carry propagates in opposite directions. Test
logic determines which of the three summed results of the full adders is output. Figure 4-1
shows a block diagram of the AGU.
Low Address ALU High Address ALU
XAB YAB PAB
Triple Multiplexer
EP
N0 M0 R0 R4 M4 N4
N1 M1 Address R1 R5 Address M5 N5
ALU R2 R6 ALU M6 N6
N2 M2
N3 M3 R3 R7 M7 N7
Global Data Bus

Program Address Bus
AGU Block Diagram
Each Address ALU can update one address register from its respective address register
file during one instruction cycle. The contents of the associated modifier register specify
the type of arithmetic to be used in the address register update calculation. The modifier
value is decoded in the Address ALU.
The two Address ALUs can generate up to two addresses every instruction cycle:
One for the PAB, or
One for the XAB, or
One for the YAB, or
One for the XAB and one for the YAB
The AGU can directly address 16,777,216 locations on each of the XAB, YAB, and PAB.
Using a register triplet to address each operand, the two independent ALUs can work with
the two data memories to feed two operands to the Data ALU in a single cycle.
Sixteen-bit Compatibility Mode
The registers are:

Address Registers R0 – R3 on the Low Address ALU and R4 – R7 on the High
Address ALU
Offset Registers N0 – N3 on the Low Address ALU and N4 – N7 on the High
Address ALU
Modifier Registers M0 – M3 on the Low Address ALU and M4 – M7 on the High
Address ALU
These registers are referred to as Rn for any address register, Nn for any offset register,
and Mn for any modifier register. The Rn, Nn, and Mn registers are register triplets—that
is, the offset and modulo registers of one triplet can be used only with an address register
that belongs to the same triplet. For example, only N2 and M2 can be used only with R2.
The eight triplets are as follows:
Low Address ALU register triplets
— R0:N0:M0
— R1:N1:M1
— R2:N2:M2
— R3:N3:M3
High Address ALU register triplets
— R4:N4:M4
— R5:N5:M5
— R6:N6:M6
— R7:N7:M7
The Global Data Bus (GDB) can read from or write to each register. The address output
multiplexers select the address for the XAB, YAB, and PAB, where the address originates
from the R0 — R3 or R4 — R7 registers.
AMD SenseMI Technology
Able to adapt and learn- custom izes itself

1. pure power
● Monitors the CPU(tem perature, resource usage, power
draw)
● Optimizes power draw based on workload
● Minimizes power consumption to reduce system heat
and noise
2. precision boost
● Adjusts clock to optimize performance without pausing
instructions
● High precision(25MHz increm ents)
3. extended frequency rate
● CPU speeds are permitted beyond Precision boost
limits
● Clock Speed scales with cooling solution
○ Every clock can autom atically overclock itself
whenever the external temperature allows
● Fully autom ated
4. neural net prediction
● Builds temporary m ap of how program uses CPU
● Prepares fastest processor path based on m ap
5. sm art prefetch
● Learns how applications access data and anticipates
instructions
Infinity fabric
Before we go into any details regarding Infinity Fabric let me first do a
quick brush up on what HyperTransport is, since that is where our story
begins. HyperTransport is the equivalent of Intel’s Front Side Bus (FSB)
and is basically what the various ICs use to talk to each other. Think of it
as a point-to-point interconnect used to, well, connect ICs. Unlike Intel’s
FSB however, it is completely open source in nature and much more
flexible than its counterpart, allowing it to be used across various
processors (Intel’s FSB is tailor made for every processor variant).
Hyper Transport is also used as an interconnect for NUMA

multiprocessor implementations deploying a proprietary cache
coherency solution (and yes the Coherent Fabric we keep on talking
about is a derivative of this concept). Needless to say, it is the bread and
butter of AMD’s chip building philosophy. Infinity Fabric is something
that enlarges the scope of HyperTransport, building it up even further
into something that will be used across all its Ryzen CPUs and Vega
GPUs.
In a way, Infinity Fabric along with subsets like Coherent Fabric can be
thought of as a superset of a new and improved HyperTransport 2.0
considering it utilizes the HyperTransport messaging protocol. We do not
know much about it right now but we do know that:
It will be completely modular and
The bandwidth will scale from 30-50 GBps for notebooks and around 512
GBps for Vega GPU.
It will be used as both a network-on-chip solution as well as clustering
link between GPUs and x85 server SoCs. CCIX standard is also supported
which will allow it to be coupled with accelerators and FPGAs.
C,P,T States
A T-state was once known as a Throttling state. Back in the days before C
and P states, T-states existed to save processors from burning
themselves up when things went very badly, such as when the cooling
fan failed while the processor was running as fast as she could. If a
simple well placed temperature sensor registered that the junction
temperature was reaching a level that could cause damage to the
package or its contents, the HW power manager would place the
processor in different T-States depending upon temperature; the higher
the temperature, the higher the T-State.
The normal run state of the processor was T0. When the processor
entered a higher T-state, the manager would clock gate the cores to
slowdown execution and allow the processor to cool. For example, in T1
the HW power manager might clock gate 12% of the cycles. In rough
terms, this means that the core will run for 78% of the time and sleep for
the rest. T2 might clock gate 25% of the cycles, etc. In the very highest T-
state, over 90% of the cycles might be clock gated. (See the figure
below.)
note that in contrast to P-states, the voltage and frequency are not
changed. Also, using T-states the application runs slower not because
the processor is running slower, but because it is suspended for some
percent of the time. In some ways, you can think of a T-state as being
like a clock gated C1 state with the processor not being idle, i.e. it is still
doing something useful.
In the figure above, the top most area shows the runtime of a compute
intensive workload if no thermal overload occurs. The bottom shows the
situation with T states (i.e. before P states), where the processor begins
to toggle between running and stopped states to cool down the
processor. The middle is what happens in current processors, where the
frequency/voltage pair is reduced allowing the processor to cool.
There are a few more practical reasons you should be at least aware of
T-states.
Some technical literature now uses the term "throttling states" to mean
P-states, not T-states.
Some power management data structures, such as some defined by
ACPI, still include an unused T-state field. Many inquiries about T-states
originate from this little fact.
I suspect that T-states are still relevant in some embedded processors.
Ryzen Master Software
● CPU core clock/

voltage adjustment
● Memory
adjustments
● Personalized
Performance
○ Up to four
profiles to store
custom clock
and voltage
adjustments
● System Monitoring
○ Real-tim e
monitoring and
histogram of
per-core clock
rates and
temperature
CPU Performance
Cinebench R15 Benchmark
Running:
Nvidia GTX 1070 FE
Corsair H100i v2
16GB Crucial Ballistix DDR4 (@ XMP
3,200MHz Intel, max. 2,667MHz AMD)
Corsair HX1200i
Philips BDM3275
Concluding Remarks
The Ryzen family of processors are a new beginning

for AMD, which can now rival Intel’s processor market
share for both mainstream users and for
companies that use servers. They are still held back by
production and also by the AM4 motherboards.
This might be great news for consum ers that now we

have a competitive market. Intel and AMD can push each
other to create better and better
processors to try to consolidate their consumer base.
References
Open-Source Register Reference for AMD Family
17h Processors (PUB)
Processor Programming Reference (PPR) for AMD
Family 17h Models 00h-0Fh Processors (PUB)
Software Optimization Guide for AMD Family 17h
Processors (PUB)
Fast Identity Online (FIDO): Password-less
Authentication
Software Optimization Guide for AMD Family 17h
Models 30h and Greater Processors

Zen Architecture

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Zen Architecture

Încărcat de

Drepturi de autor:

Formate disponibile

Zen Architecture

Submitted to the Faculty of Engineering and Technology.

Under the Guidance of

DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING

DEPARTMENT OF COMPUTER SCIENCE ANDENGINEERING

This is to certify that K.Jeevan Reddy bearing Roll No. B17CS110

Seminar Coordinator Head of the Department

I extend my sincere and heartfelt thanks to our esteemed counselor B.Raju,

I am grateful to respected coordinator S. Venkatramulu, Associate Professor for

What it is u sed for:

Som e specifics about t he chip it self:

Design work on the microarchitecture began

● Two threads per core (SMT - Simultaneous Multithreading )

● All-new Micro-op Cache

● Up to 20MB Unified Cache

● Two AES units for security (Advanced Encryption Standard)

● High-precision MHz Boost

● High efficiency FinFET transistors

Each Ryzen core has 1 Floating-point unit and 1

Can decode four instructions per cycle.

 UClk - UMC Clo ck

4.1 AGU Architecture

The offset N (stored in the respective offset register)

Low Address ALU High Address ALU

XAB YAB PAB

Global Data Bus

AGU Block Diagram

The registers are:

Able to adapt and learn- custom izes itself

Hyper Transport is also used as an interconnect for NUMA

● CPU core clock/

The Ryzen family of processors are a new beginning

This might be great news for consum ers that now we

S-ar putea să vă placă și