Sunteți pe pagina 1din 17

Parallel Prefix Sum (Scan) with Cuda

Introduction

Parallel Prefix Sum (Scan) with Cuda


Dimitrios Leventeas

20 June 2011

Parallel Prefix Sum (Scan) with Cuda


Introduction
Definition

All-prefix-sums

Definition (All-prefix-sums)
The all-prefix-sums operation (scan) takes a binary associative
operator , and an array of n elements
[a0 , a1 , . . . , an1 ],
and returns
[a0 , (a0 a1 ), . . . , (a0 a1 an1 )].

Parallel Prefix Sum (Scan) with Cuda


Introduction
Sequential Algorithm

Sequential Algorithm

i m p o r t random
i n p = [ random . r a n d i n t ( 0 , 9 ) f o r x i n r a n g e ( 1 0 ) ]
out = [ ]
o u t . append ( i n p [ 0 ] )
for i in range (1 , len ( inp ) ) :
o u t . append ( o u t [ 1] + i n p [ i ] )
print ( inp )
p r i n t ( out )

Parallel Prefix Sum (Scan) with Cuda


Introduction
Sequential Algorithm

Example

input:
output:

6
6

1
7

5
12

10
22

1
23

7
30

2
32

5
37

7
44

Parallel Prefix Sum (Scan) with Cuda


Introduction
Inclusive/Exclusive Scan

Exclusive Scan

Definition (Exclusive Scan)


The exclusive scan operation takes a binary associative operator
with identity I, and an array of n elements
[a0 , a1 , . . . , an1 ]
and returns the array:
[I , a0 , (a0 a1 ), . . . , (a0 a1 an2 )]

Parallel Prefix Sum (Scan) with Cuda


Introduction
Inclusive/Exclusive Scan

Example

input:
output:

6
0

1
6

5
7

10
12

1
22

7
23

2
30

5
32

7
37

Parallel Prefix Sum (Scan) with Cuda


Introduction
Inclusive/Exclusive Scan

Notes

Exclusive scan = Inclusive scan >> 1 and first element = I


X
Inclusive scan = Exclusive scan << 1 and last element =
ai

Parallel Prefix Sum (Scan) with Cuda


Introduction
Uses

Uses
1

Lexically compare strings

Add multi-precision numbers

Evaluate polynomials

Solve recurrences

Implement radix sort

Implement quick sort

Solve tridiagonal linear systems

Delete marked elements from an array

Dynamically allocate processors

10

Perform lexical analysis

11

Search for regular expressions

Parallel Prefix Sum (Scan) with Cuda


Parallel Scan
Basic concepts

Basic concepts

Definition (Work-efficient)
No more operations (or work) than the sequential version. The two
implementations must have the same work complexity.
Definition (Step complexity)
The number of steps that the algorithm executes.

Parallel Prefix Sum (Scan) with Cuda


Parallel Scan
Hillis and Steele Algorithm

Example

x0

x1

x2

x3

x4

x5

x6

x7

Parallel Prefix Sum (Scan) with Cuda


Parallel Scan
Hillis and Steele Algorithm

Example

x0 OO
OOO
P
(x0 ..x0 )

OOO
OOO
O'

x1 OO
OOO


P
(x0 ..x1 )

OOO
OOO
O'

x2 OO
OOO


P
(x1 ..x2 )

OOO
OOO
O'

x3 OO
OOO


P
(x2 ..x3 )

OOO
OOO
O'

x4 OO
OOO


P
(x3 ..x4 )

OOO
OOO
O'

x5 OO
OOO


P
(x4 ..x5 )

OOO
OOO
O'

x6 OO
OOO


P
(x5 ..x6 )

x7

OOO
OOO
O'

P 
(x6 ..x7 )

Parallel Prefix Sum (Scan) with Cuda


Parallel Scan
Hillis and Steele Algorithm

Example

x0 OO
OOO

OOO
OOO
O'

P
(x0 ..x0 ) W
P
(x0 ..x0 )

x1 OO
OOO

OOO
OOO
O'

x2 OO
OOO

OOO
OOO
O'

x3 OO
OOO

OOO
OOO
O'

x4 OO
OOO

OOO
OOO
O'

x5 OO
OOO

OOO
OOO
O'

x6 OO
OOO

x7

OOO
OOO
O'

P 

(x0 ..x1 ) W
(x1 ..x2 ) W
(x2 ..x3 ) W
(x3 ..x4 ) W
(x4 ..x5 ) W
(x5 ..x6 )
(x6 ..x7 )
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
W
W
W
W
W
W
W
W
W
W
WW+ P 
WW+ P 
WW+ P 
WW+ P 
WW+ P 
+ P 
P
(x0 ..x1 )

(x0 ..x2 )

(x0 ..x3 )

(x1 ..x4 )

(x2 ..x5 )

(x3 ..x6 )

(x4 ..x7 )

Parallel Prefix Sum (Scan) with Cuda


Parallel Scan
Hillis and Steele Algorithm

Example

x0 OO
OOO

OOO
OOO
O'

P
(x0 ..x0 ) W

x1 OO
OOO

OOO
OOO
O'

x2 OO
OOO

OOO
OOO
O'

x3 OO
OOO

OOO
OOO
O'

x4 OO
OOO

OOO
OOO
O'

x5 OO
OOO

OOO
OOO
O'

x6 OO
OOO

x7

OOO
OOO
O'

P 

(x0 ..x1 ) W
(x1 ..x2 ) W
(x2 ..x3 ) W
(x3 ..x4 ) W
(x4 ..x5 ) W
(x5 ..x6 )
(x6 ..x7 )
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
WWWW
W
W
W
W
W
W
W
W
W
W
WW+ P 
WW+ P 
WW+ P 
WW+ P 
WW+ P 
+ P 
P
P
(x0 ..x0 ) [[[[[[[[ (x0 ..x1 ) [[[[[[[[ (x0 ..x2 ) [[[[[[[[ (x0 ..x3 ) [[[[[[[[ (x1 ..x4 )
(x2 ..x5 )
(x3 ..x6 )
(x4 ..x7 )
[[[[[[[[
[
[
[
[[[[[[[[ [[[[[[[[[[[[[[[ [[[[[[[[[[[[[[[ [[[[[[[[[[[[[[[
[[[[[[[[
[[[[[[[[
[[[[[[[[
[[[[[[[[
[[[[[[[[
[[[[[[[[
[[[[[[[[
[[[[[[[[
[
[
[
[
P
P
P
P [[[[[[[[[[[- P [ [[[[[[[[[[- P [ [[[[[[[[[[- P [ [[[[[[[[[[- P 
(x0 ..x0 )

(x0 ..x1 )

(x0 ..x2 )

(x0 ..x3 )

(x0 ..x4 )

(x0 ..x5 )

(x0 ..x6 )

(x0 ..x7 )

Parallel Prefix Sum (Scan) with Cuda


Parallel Scan
Hillis and Steele Algorithm

1st attempt: Hillis and Steele Algorithm

Algorithm 1 Hillis and Steele Algorithm


Require: Array x of length n
Ensure: In place scan of array x
1:
2:
3:
4:
5:
6:
7:

for d = 1 to log2 n do
for k n in parallel do
if k 2d then
x[k] = x[k 2d1 ] + x[k]
end if
end for
end for

Parallel Prefix Sum (Scan) with Cuda


Parallel Scan
Hillis and Steele Algorithm

Work complexity

Theorem
The algorithm performs:
log2 n
log2 n
log2 n
X
X
X
d1
(n 2
)=
n
2d1 = n log n n = O(n log n)
d=1

d=1

d=1

Parallel Prefix Sum (Scan) with Cuda


Parallel Scan
Hillis and Steele Algorithm

Notes

+ step complexity O(log n).


work complexity O(n log n).
We need O(n) processors. Otherwise, we buffer the
intermediate results.

Parallel Prefix Sum (Scan) with Cuda


Parallel Scan
Hillis and Steele Algorithm

2nd attempt: Hillis and Steele Algorithm with buffering


Algorithm 2 Hillis and Steele Algorithm
Require: Array x of length n
Ensure: In place scan of array x
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:

for d = 1 to log2 n do
for k n in parallel do
if k 2d then
x[out][k] = x[k 2d1 ] + x[k]
else
x[out][k] = x[in][k]
end if
end for
swap(in, out)
end for

S-ar putea să vă placă și