Documente Academic
Documente Profesional
Documente Cultură
bob.valentine@intel.com
Agenda
Introduction - Overview of ISA Extensions Haswell New Instructions
New Instructions Overview Intel AVX2 (256-bit Integer Vectors) Gather FMA: Fused Multiply-Add Bit Manipulation Instructions TSX/HLE/RTM
Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
1999 2001 2004 2006 2007 2008 2010 2011 2012 2013
http://en.wikipedia.org/wiki/X86_instruction_listings
Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
2010
Westmere
2011
Sandy Bridge
New
2012
Ivy Bridge
2013
Haswell
New
2014
Broadwell
2015
Skylake
New
Micro-
New process
22nm
Microarchitecture
22nm
New process
Microarchitecture
TICK
TOCK TOCK
TICK TICK
TOCK
TICK
TOCK
Tick-tock delivers leadership through technology innovation on a reliable and predictable timeline
Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
AVX2
SIMD Integer Instructions promoted to 256 bits Gather Shuffling / Data Rearrangement
Load elements from vector of indices vectorization enabler Blend, element shift and permute instructions
170 / 124
Fused Multiply-Add operation forms ( FMA-3) Improving performance of bit stream manipulation and decode, large integer arithmetic and hashes Transactional Memory MOVBE: Load and Store of Big Endian forms INVPCID: Invalidate processor context ID
96 / 60 15 / 15
4/4 2/2
Intel AVX2 completes the 256-bit extensions started with Intel AVX
7
Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Gather
Gather is a fundamental building block for vectorizing indirect memory accesses.
int a[] = {1,2,3,4,,99,100}; int b[] = {0,4,8,12,16,20,24,28}; // indices for (i=0; i<n; i++) { x[i] = a[b[i]]; }
Haswell introduces set of GATHER instructions to allow automatic vectorization of similar loops with non-adjacent, indirect memory accesses
HSW GATHER not always faster software can make optimizing
assumptions
Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
integer data (no FP data) indicate index size; here double word (32bit) to indicate data size; here quad word (64 bit) integer destination register index register; here 4-byte size indices scale factor general register containing base address offset added to base address mask register; only most significant bit of each element used
Gather: Fundamental building block for nonadjacent, indirect memory accesses for either integer or floating point data
10
Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
VPGATHERDQ
Intrinsics
base
index
scale
0
Index : ymm9
n/a
n/a
n/a
n/a
10
14
4
Index Size
255 7777777766666666
63
memory
5454545454545454 d3 dcdcdcdcdcdcdcdc d2
dest : ymm1
Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Destination register either XMM or YMM Mask register matches destination No difference in instruction name
Index Size
Data Size
4 4 8
VGATHERDPS VPGATHERDD
8
VGATHERDPD VPGATHERDQ
VGATHERQPS VPGATHERQD
VGATHERQPD VPGATHERQQ
Gather is fully deterministic Same behavior from run to run even in case of exceptions Mask bits are set to 0 after processing
Gather is complete when mask is all 0 The instruction can be restarted in the middle
12
Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
13
Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
New Strategy
Control register
any
any
any
any
any
any
any
any
Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
VPERMD
Instruction
VPERMD
Intrinsics
256
128
5 23
4 22
6 21
3 20
3 19
3 18
0 17
0 16
VPERMD ymm1,ymm2,ymm3
b : ymm1
21
20
22
19
19
19
16
16
Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
d7
d6
d5
d4
d3
d2
d1
d0
dst
d7 >> cnt7
d6 >> cnt6
d5 >> cnt5
d4 >> cnt4
d3 >> cnt3
d2 >> cnt2
d1 >> cnt1
d0 >> cnt0
* If the controls are greater than data width, then the destination data element are written with 0.
Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Can perform 8 single-precision FMA operations or 4 double-precision FMA operations with 256-bit vectors per FMA unit
Increases FLOPS capacity over Intel AVX Maximum throughput of two FMA operations per cycle
ab c
(ab + c)
fused negative mul- ps, pd, ss, add sd fused negative mul- ps, pd, ss, sub sd ps, pd ps, pd
(ab c)
a[i]b[i] + c[i] on odd indices a[i]b[i] c[i] on even indices a[i]b[i] c[i] on odd indices a[i]b[i] + c[i] on even indices
Multiple FMA varieties support both data types and eliminate additional negation instructions
18
Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
19
Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Zero High Bits Starting with Specified Position Variable Shifts (non-destructive, have Load+Operation forms, no implicit CL dependency and no flags effect) Z = X << Y, Z = X >> Y (for signed and unsigned X) Parallel Bit Deposit Parallel Bit Extract Rotate Without Affecting Flags Unsigned Multiply Without Affecting Flags
Note: Software needs to check for CPUID bits of all groups it uses instructions from
20
Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Thread 2
X Y
Thread 1 and 2
Lock
Thread 1
X Y
No data conflicts
Thread 2
Data conflicts truly limit concurrency, not lock contention Focus on data conflicts, not lock contention
23
Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
HLE Execution
Thread 1
Lock Acq. Critical section Lock Rel.
Thread 2
Lock Acq. Critical section Lock Rel.
X Y
24
Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
25
Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
XBEGIN: Starts a transaction in the pipeline. Causes a checkpoint of register state and all following memory transactions to be buffered. Rel16/32 is the fallback handler. Nested depth support up to 7 XBEGINS. Abort if depth is 8. All abort roll-back is to outermost region. XEND: Causes all buffered state to be atomically committed to memory. LOCK ordering semantics (even for empty transactions) XABORT: Causes all buffered state to be discarded and register checkpoint to be recovered. Will jump to the XBEGIN labeled fallback handler. Takes imm8 argument. There is also an XTEST instruction which can be used both for HLE and RTM to query whether execution takes place in a transaction region
26
Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
RTM
Execute alternative code path if RTM aborts Visible to software Provides return code with reason tra
-Qxcore-avx2, -Qaxcore-avx2 (Windows*) -xcore-avx2 and axcore-avx2 (Linux) -march=core-avx2 Intel compiler Separate options for FMA:
-Qfma, -Qfma- (Windows) -fma, -no-fma (Linux)
Asm, intrinsics, automatic vectorization Asm, intrinsics, automatic vectorization All supported through asm/intrinsics Some through automatic code generation
Asm only
Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
28
29
Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
References
New Intel AVX (AVX2) instructions specification http://software.intel.com/file/36945 Forum on Haswell New Instructions http://software.intel.com/en-us/blogs/2011/06/13/haswellnew-instruction-descriptions-now-available/ Article from Oracle engineers on how to use hardwaresupported transactional memory on user level code ( not Intel TSX specific ) http://labs.oracle.com/scalable/pubs/HTM-algs-SPAA2010.pdf
30
Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
31
Copyright 2012, Intel Corporation. All rights reserved. Partially Intel Confidential *Other brands and names are the property of their respective owners. 2012, Intel Corporation. All rights reserved. Copyright *Other brands and names are the property of their respective owners.
Information.
VGATHERDD
Instruction
VPGATHERDD
Intrinsics
base
index
0
Index : ymm0
23
21
19
17
0
Index Size
255 77777777
0 00000000 d0 88888888
memory
ffffffff
dest : ymm1
d6 87878787d7 66666666
d5 65656565
d4 44444444
d3 22222222 d2 43434343
d1 21212121
d0 00000000
Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
VGATHERQQ
Instruction
VPGATHERQQ
Intrinsics
base
index
0
Index : ymm0
255
15
10
0
Index Size
127 3333333322222222
63
0 1111111100000000 d0
memory
5454545454545454 d2 dcdcdcdcdcdcdcdc
dest : ymm1
Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
VGATHERQD
Instruction
VPGATHERQD
Intrinsics
base
index
0
Index : ymm0
15
2
Index Size
255 77777777
memory
dest : ymm1
d3 44444444 d2 65656565
d1 43434343
d0 22222222
Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
4 VGATHERDD Gather Packed Dword using signed Dword Indices Int Index Size
__m128i __m128i __m256i __m256i _mm_i32gather_epi32() _mm_mask_i32gather_epi32() _mm256_i32gather_epi32() _mm256_mask_i32gather_epi32()
8 Int
Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Example:
Notes
Combination of 20 varieties with 3 operand orderings result in 60 new mnemonicsAT&T* format, read operands right-to-left:
Src3, 2, 1
FMA operand ordering allows complete flexibility in selection of memory operand and destination
37
Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
FMA Latency
Due to out-of-order, FMA latency is not always better than separate multiply and add instructions Add latency: 3 cycles Multiply and FMA latencies: 5 cycles Will likely differ in later CPUs a
(ab) + c)
a c
b
FMA
c
+ Latency = 5
+
a b +
Latency = 8
c
+
FMA
(ab) + (cd)
Latency = 8
Latency = 10
Copyright 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.