Parallel Solver For Bordered Block Diagonal Matrix

Introduction
Problem Formulation
Software Implementation
Hardware Implementation
Parallel Solver for Bordered Block Diagonal Matrix
Shashank Gangrade
Under the guidance of Prof. Sachin. B. Patkar,
Department of Electrical Engineering

Indian Institute of Technology, Bombay
Oct 26, 2016
1 / 29
Introduction
Problem Formulation
Outline
1 Introduction
2 Problem Formulation
3 Software Implementation
External Packages
Software API
Results
4 Hardware Implementation
Memory Bottlneck
Hardware Blocks
Matrix Inversion Unit
Matrix Multiplication
Functional Block
Results
5 Conclusion & Future Work
2 / 29
Introduction
Problem Formulation
Motivation
Circuit simulator for linear networks with large number of

nodes
Use the Bordered Block Diagonal Matrix generated using
node tearing nodal analysis
Design a system to solve Bordered Block Diagonal Matrix as
large as 1M×1M nodes
Use parallel methods in Hardware and Software to design
performce critical and resource efficient system
3 / 29
Introduction
Problem Formulation
A general form of Bordered Block Diagonal Matrix can be written

as follows: AX = G ,
    
A1 0 0 ... 0 B1 X1 G1
 0 A2 0 . . . 0 B2  X2
   G2
    
0 0 A3 . . . 0 B3  X3
   G3
=
    
 .. .. .. . . .. ..  ..  ..
 .
 . . . . . 
 .
 
 

 .
0 0 0 . . . AN-1 BN-1  XN-1  GN-1 
C1 C2 C3 . . . CN-1 AN XN GN
Ai is a m × m matrix Gi is a m × 1 vector
Bi is a m × n matrix Xi is a m × 1 vector
Ci is a n × m matrix GN is a n × 1 vector
AN is a n × n matrix XN is a n × 1 vector
4 / 29
Introduction
Problem Formulation
For each of the i th row in range [1, N-1], row equations can be
written as
Xi = A−1
i (Gi − Bi XN ) (1)
Similarly soving for XN in the N th row
C1 X1 + C2 X2 + · · · + CN−1 XN−1 + AN XN = GN
AN XN = GN − (C1 X1 + C2 X2 + · · · + CN−1 XN−1 )

N−1
X
AN XN = GN − Ci Xi (2)
i=1
In the above substitute value of Xi in terms of XN to get above in

terms on XN only
5 / 29
Introduction
Problem Formulation
N−1
X
AN XN = GN − (Ci A−1
i (Gi − Bi XN ))
i=1
Taking XN terms on one side

N−1
X N−1
X
AN XN − (Ci A−1
i Bi XN ) = GN − (Ci A−1
i Gi )
i=1 i=1
PN−1 −1
GN − i=1 Ci Ai Gi
XN = PN−1 −1
(3)
AN − i=1 Ci Ai Bi
Let Gi∗ and Bi∗ denote the individual sigma term in XN
Gi∗ = Ci A−1 ∗ −1
i Gi , Bi = Ci Ai Bi
6 / 29
Introduction
Problem Formulation
Solutions of AX = B
Xi = A−1
i (Gi − Bi XN ), ∀i ∈ (1, N-1)
GN − N−1 ∗
P
i=1 G
XN = PN−1 i∗
AN − i=1 Bi
Gi∗ = Ci A−1 ∗ −1
i Gi , Bi = Ci Ai Bi
We can find two parallel blocks in this,

Each Gi∗ and Bi∗ can be calculated in parallel, speedup in
calculation of XN
Each Xi can be calculated in parallel from XN , speedup in
calculation of Xi
7 / 29
Introduction
External Packages
Problem Formulation
Software API
Results
Motivation for Software:

Basic idea is to have a complete C based implementation of
Bordered Block Diagonal Matrix Solver
Software system can serve as an benchmark for further
hardware design
Profile the run times of various parts of program
Exploit the inherent parallelism using multi core chips present
on the CPU
8 / 29
Introduction
External Packages
Problem Formulation
Software API
Results
Software API
Diagonal Matrix Ai is m × m,
Border matrix Bi is m × n and Ci is n × m
Number of diagonal blocks of matrix is N
Tile size is the size of eack tile matrix, TileSize = m = n
Linear Algebra Packages:
LAPACK
LAPACKE
BLAS
Parallel Progamming:
OpenMP
9 / 29
Introduction
External Packages
Problem Formulation
Software API
Results
Algorithm Flow Diagram
Description of parts
P ∗ P ∗
Part1: Calculate Gi & Gi
∗ −1 ∗ −1
Gi = Ci Ai Gi , Bi = Ci Ai Bi
Gi∗
P
Part2:
P ∗Calculate XN from
& Gi
Part3: Calculate individual Xi

Xi = A−1
i (Gi − Bi XN )
10 / 29
Introduction
External Packages
Problem Formulation
Software API
Results
Run time comparison for BBD solver and standard solver
BBD Matrix System Lapack solver

N Tile Size Part1 Part2 Part3 Total Total
100 4 0.255 0.017 0.049 0.329 6.904
500 4 1.036 0.03 0.238 1.313 266.813
1000 4 1.882 0.024 0.578 2.505 1939.958
1500 4 3.041 0.023 0.798 3.877 6874.44
100 8 0.956 0.033 0.131 1.134 23.711
500 8 1.899 0.033 0.377 2.32 212.629
1000 8 5.526 0.036 1.026 6.601 15743.108
1500 8 5.632 0.027 1.055 6.723 50147.257
Table: Runtimes in ms for different sized matrices
Note: Standard solver is dgesv () function from LAPACK library 11 / 29

Introduction
External Packages
Problem Formulation
Software API
Results
Results: Comparison of BBD solver with Lapack
12 / 29
Introduction
External Packages
Problem Formulation
Software API
Results
Run times for large BBD solver
N Run Time (ms)

Tile Size=4 Tile Size=8
100 0.329 1.134
500 1.313 2.32
1000 2.505 6.601
5k 12.742 26.13
10k 24.483 46.477
50k 119.576 277.521
100k 246.337 480.946
500k 1896.716 2843.414
1M 2334.047 5532.105
Table: Run times for BBD matrix solver with large number of blocks
13 / 29
Introduction
External Packages
Problem Formulation
Software API
Results
Run times for large BBD solver
14 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Results
Idea
The basic idea is to come up with a hardware design, which

can perform better than a software API running on multi-core
CPU
15 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Results
Design in hardware
Basis of Design Methodology

Follow a performance driven design methodology
Understand the various bottlenecks in system
Extract parallelism given the constraints of bottle neck
Perform better than software to justify more design effort
Specifications
Design targeted towards a high end FPGA like Virtex-6
Use high level functional hardware description language,
Bluespec
16 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Results
Memory Bottleneck
Latency and bandwith are botlenecks in performance of hardware

For a Typical DDR3-SODIMM memory onboard a Virtex-6
ML605
Bandwidth = 8.5GB/s or 1066 MT/s
Latency = 10-20 clock cycles
For 4 × 4 matrix storing float values, this can get on an
average single tile of matrix in a clock cycle
Design a system working under these constraints can deliver
maximum performance
17 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Results
Hardware Design
The system has following specifications

Each matrix is a 4 × 4 matrix, elements stored as 32bit single
precision floating point number
Memory bandwith gives 128bits per cycle
Memory Access pattern is predefined, hence latency is hidden
by prefetch
It takes 4 clock cyles to bring a complete matrix from memory
to FPGA
18 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Results
Tile Inversion Block
Matrix Inverse
Inversion is based on the idea of Gauss Jordan Elimination
Perform a set of operations of input matrix and a predefined
identity matrix
Succesive operations convert the matrix into identity matrix
and identity matrix is transformed to inverse of initial matrix
Hardware Specifications
Inputs one tile of matrix in one cycle
For a n × n matrix inverse can be calculated in n2 cycles
In every cycle either calculate a FP division or FP multiply add
Each Inversion unit will have 8 FP MAC unit and 8 FP
Division unit
19 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Results
Figure: Inversion Unit for 3x3

20 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Results
Matrix-matrix Multiplication Block
Matrix-matrix Multiplication
This is based on rank-one update matrix multplication
algorithm[1]
Adding a rank one matrix to existing matrix
Hardware Specifications
Inputs one tile of matrix in one cycle
For a n × n matrix multiplication can be calculated in n2 cycles
In every cycle we do a FP multiply add
Each multiply unit will have 4 FP MAC unit
21 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Results
Figure: Matrix Multiplication Unit for 3x3
22 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Results
Functional Unit
Functional unit can calculate B ∗ by adding Bi∗ from i th block

Given memory bandwith and latency of blocks, we have 4
functional units in our design
B ∗ = P Bi∗
P
G ∗ = Gi∗
Figure: Functional Unit
23 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Results
Example of Hardware Data Flow
Matrix A with tile size of 4 × 4, and number of diagonal tiles

is N=13
Every element of A is stored as a 32 bit floating point variable
Memory can fetch 128 bit in one clock cycle
    
A1 0 . . . 0 B1 X1 G1
 0 A2 . . . 0 B2   X2   G2 
   
 
 .. .. . . .. ..   ..  =  .. 
 .
 . . . .  .   . 
   
0 0 . . . A12 B12  X12  G12 
C1 C2 . . . C12 A13 X13 G13
24 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Results
Memory Bandwidth allows 128 bit

in a clock Matrix Clock# Matrix Clock#
Ai , Bi , Ci are 4 × 4 matrices, takes A1 4 B1 36

A2 8 B2 40
4 cycles to fetch A3 12 B3 44
Gi is a 4 × 1 vector, takes 1 cycle A4 16 B4 48
to fetch C1 20 G1 49
C2 24 G2 50
As per the access patters shown C3 28 G3 51
here, it takes 52 clock cycles to C4 32 G4 52
get data for 4 functional units Table: Memeory Access pattern
This access pattern will repeat
every 52 clocy cycles, for (N-1)/4
times
25 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Results

Animantion
Hardware Data Flow: link to animation

Number of Cycles taken: 52 × (12/4) + 28
26 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Results
Results
We judge the performance of a design on the basis of the following

criteria:
P ∗ P ∗
Cycles to calculate BBD solutions: B and G
52 × (N/4) + 28
Number of FP operations per cycle
(16 + 16 + 16 + 4) × 4 + 16
4× ' 17
52
Number of FP Units needed:
80 FP MAC + 32 FP Division + 2 FP ADD
Memory bandwidth 128bits × 50MHz = 800MB/s
27 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Results
Results
For a system operating at 50MHz clock, and N=10000,

Time = 2.6ms,
For a software under similar data types,
Time = 19.168ms
Calculated Speedup Achieved: 8x
28 / 29
Introduction
Problem Formulation
Conclusion & Future Work
Conclusion
We observe sufficient speedup in performance of hardware as
compared to software implementatons
The design methadology can be scaled for larger designs and
higher bandwidths
Future Work
Focus on resusing FP units, so that resource usage is
minimized
Do a cycle accurate testing of integrated hardware design in
simulation
Run the hardware design on FPGA using Bluespec emulation
platform
Design the system for scalibility, large block matrices
29 / 29
For Further Reading
References I
Mahendra Burdhak
Efficient Simulation of Large Non-Linear Circuits using
Partioning and Parallelism
DDP phase-1 Report, 2016
Kumar, V.B.Y., Joshi, S., Patkar, S.B. et al.
FPGA Based High Performance Double-Precision Matrix
Multiplication
Int J Parallel Prog (2010) 38: 322.
ML605 Hardware User Guide
http://www.xilinx.com/support/documentation/
boards_and_kits/ug534.pdf
30 / 29

Parallel Solver For Bordered Block Diagonal Matrix

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Parallel Solver For Bordered Block Diagonal Matrix

Încărcat de

Drepturi de autor:

Formate disponibile

Introduction

Parallel Solver for Bordered Block Diagonal Matrix

Under the guidance of Prof. Sachin. B. Patkar,

Department of Electrical Engineering

Oct 26, 2016

Circuit simulator for linear networks with large number of

A general form of Bordered Block Diagonal Matrix can be written

AN XN = GN − (C1 X1 + C2 X2 + · · · + CN−1 XN−1 )

In the above substitute value of Xi in terms of XN to get above in

Taking XN terms on one side

We can find two parallel blocks in this,

Motivation for Software:

Algorithm Flow Diagram

Part3: Calculate individual Xi

Run time comparison for BBD solver and standard solver

BBD Matrix System Lapack solver

Table: Runtimes in ms for different sized matrices

Note: Standard solver is dgesv () function from LAPACK library 11 / 29

Results: Comparison of BBD solver with Lapack

Run times for large BBD solver

N Run Time (ms)

Run times for large BBD solver

The basic idea is to come up with a hardware design, which

Basis of Design Methodology

Latency and bandwith are botlenecks in performance of hardware

The system has following specifications

Tile Inversion Block

Figure: Inversion Unit for 3x3

Matrix-matrix Multiplication Block

Figure: Matrix Multiplication Unit for 3x3

Functional unit can calculate B ∗ by adding Bi∗ from i th block

Figure: Functional Unit

Example of Hardware Data Flow

Matrix A with tile size of 4 × 4, and number of diagonal tiles

Example of Hardware Data Flow

Memory Bandwidth allows 128 bit

Ai , Bi , Ci are 4 × 4 matrices, takes A1 4 B1 36

Example of Hardware Data Flow

Hardware Data Flow: link to animation

We judge the performance of a design on the basis of the following

For a system operating at 50MHz clock, and N=10000,

Calculated Speedup Achieved: 8x

Conclusion & Future Work

S-ar putea să vă placă și