Documente Academic
Documente Profesional
Documente Cultură
Problem Formulation
Software Implementation
Hardware Implementation
Shashank Gangrade
1 / 29
Introduction
Problem Formulation
Software Implementation
Hardware Implementation
Outline
1 Introduction
2 Problem Formulation
3 Software Implementation
External Packages
Software API
Results
4 Hardware Implementation
Memory Bottlneck
Hardware Blocks
Matrix Inversion Unit
Matrix Multiplication
Functional Block
Results
5 Conclusion & Future Work
2 / 29
Introduction
Problem Formulation
Software Implementation
Hardware Implementation
Motivation
3 / 29
Introduction
Problem Formulation
Software Implementation
Hardware Implementation
Ai is a m × m matrix Gi is a m × 1 vector
Bi is a m × n matrix Xi is a m × 1 vector
Ci is a n × m matrix GN is a n × 1 vector
AN is a n × n matrix XN is a n × 1 vector
4 / 29
Introduction
Problem Formulation
Software Implementation
Hardware Implementation
For each of the i th row in range [1, N-1], row equations can be
written as
Xi = A−1
i (Gi − Bi XN ) (1)
Similarly soving for XN in the N th row
C1 X1 + C2 X2 + · · · + CN−1 XN−1 + AN XN = GN
5 / 29
Introduction
Problem Formulation
Software Implementation
Hardware Implementation
N−1
X
AN XN = GN − (Ci A−1
i (Gi − Bi XN ))
i=1
Gi∗ = Ci A−1 ∗ −1
i Gi , Bi = Ci Ai Bi
6 / 29
Introduction
Problem Formulation
Software Implementation
Hardware Implementation
Solutions of AX = B
Xi = A−1
i (Gi − Bi XN ), ∀i ∈ (1, N-1)
GN − N−1 ∗
P
i=1 G
XN = PN−1 i∗
AN − i=1 Bi
Gi∗ = Ci A−1 ∗ −1
i Gi , Bi = Ci Ai Bi
7 / 29
Introduction
External Packages
Problem Formulation
Software API
Software Implementation
Results
Hardware Implementation
Software Implementation
8 / 29
Introduction
External Packages
Problem Formulation
Software API
Software Implementation
Results
Hardware Implementation
Software API
Diagonal Matrix Ai is m × m,
Border matrix Bi is m × n and Ci is n × m
Number of diagonal blocks of matrix is N
Tile size is the size of eack tile matrix, TileSize = m = n
Linear Algebra Packages:
LAPACK
LAPACKE
BLAS
Parallel Progamming:
OpenMP
9 / 29
Introduction
External Packages
Problem Formulation
Software API
Software Implementation
Results
Hardware Implementation
Description of parts
P ∗ P ∗
Part1: Calculate Gi & Gi
∗ −1 ∗ −1
Gi = Ci Ai Gi , Bi = Ci Ai Bi
Gi∗
P
Part2:
P ∗Calculate XN from
& Gi
10 / 29
Introduction
External Packages
Problem Formulation
Software API
Software Implementation
Results
Hardware Implementation
12 / 29
Introduction
External Packages
Problem Formulation
Software API
Software Implementation
Results
Hardware Implementation
Table: Run times for BBD matrix solver with large number of blocks
13 / 29
Introduction
External Packages
Problem Formulation
Software API
Software Implementation
Results
Hardware Implementation
14 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Software Implementation
Results
Hardware Implementation
Hardware Implementation
Idea
15 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Software Implementation
Results
Hardware Implementation
Design in hardware
16 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Software Implementation
Results
Hardware Implementation
Memory Bottleneck
17 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Software Implementation
Results
Hardware Implementation
Hardware Design
18 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Software Implementation
Results
Hardware Implementation
Matrix Inverse
Inversion is based on the idea of Gauss Jordan Elimination
Perform a set of operations of input matrix and a predefined
identity matrix
Succesive operations convert the matrix into identity matrix
and identity matrix is transformed to inverse of initial matrix
Hardware Specifications
Inputs one tile of matrix in one cycle
For a n × n matrix inverse can be calculated in n2 cycles
In every cycle either calculate a FP division or FP multiply add
Each Inversion unit will have 8 FP MAC unit and 8 FP
Division unit
19 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Software Implementation
Results
Hardware Implementation
Matrix-matrix Multiplication
This is based on rank-one update matrix multplication
algorithm[1]
Adding a rank one matrix to existing matrix
Hardware Specifications
Inputs one tile of matrix in one cycle
For a n × n matrix multiplication can be calculated in n2 cycles
In every cycle we do a FP multiply add
Each multiply unit will have 4 FP MAC unit
21 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Software Implementation
Results
Hardware Implementation
22 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Software Implementation
Results
Hardware Implementation
Functional Unit
B ∗ = P Bi∗
P
G ∗ = Gi∗
23 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Software Implementation
Results
Hardware Implementation
24 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Software Implementation
Results
Hardware Implementation
26 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Software Implementation
Results
Hardware Implementation
Results
27 / 29
Introduction
Memory Bottlneck
Problem Formulation
Hardware Blocks
Software Implementation
Results
Hardware Implementation
Results
28 / 29
Introduction
Problem Formulation
Software Implementation
Hardware Implementation
Conclusion
We observe sufficient speedup in performance of hardware as
compared to software implementatons
The design methadology can be scaled for larger designs and
higher bandwidths
Future Work
Focus on resusing FP units, so that resource usage is
minimized
Do a cycle accurate testing of integrated hardware design in
simulation
Run the hardware design on FPGA using Bluespec emulation
platform
Design the system for scalibility, large block matrices
29 / 29
For Further Reading
References I
Mahendra Burdhak
Efficient Simulation of Large Non-Linear Circuits using
Partioning and Parallelism
DDP phase-1 Report, 2016
Kumar, V.B.Y., Joshi, S., Patkar, S.B. et al.
FPGA Based High Performance Double-Precision Matrix
Multiplication
Int J Parallel Prog (2010) 38: 322.
ML605 Hardware User Guide
http://www.xilinx.com/support/documentation/
boards_and_kits/ug534.pdf
30 / 29