Sunteți pe pagina 1din 13

Comparison of Multimedia SIMD, GPUs and Vector Architectures

(Data Parallelism Hennessy Section 4.4) ByHarsh Prasad 2008CS50210


CSL718 05-Apr-12

Introduction

A common way to increase parallelism among instructions is to exploit data parallelism among independent iterations of a loop SIMD architectures can exploit significant data-level parallelism for:
matrix-oriented scientific computing media-oriented image and sound processors

SIMD is more energy efficient than MIMD SIMD Parallelism


Vector architectures SIMD extensions Graphics Processor Units (GPUs)

These architectures are designed to execute Data Level parallel Programs


CSL718 05-Apr-12

Vector Architectures

Read sets of data elements into vector registers Operate on those registers Disperse the results back into memory Example: VMIPS Improvements
Multiple Lanes Gather-Scatter Memory Addressing

CSL718

05-Apr-12

Basic Structure of Vector Register Architecture (Vector MIPS)


4

Multi-Banked memory for bandwidth and latencyhiding

Pipelined Vector Functional Units

Vector LoadStore Units (LSUs)

Each Vector Register has MVL elements (each 64 bits) 2


Vector Control Registers
VLR Vector Length Register VM Vector Mask Register
CSL718

MVL = Maximum Vector Length


05-Apr-12

SIMD Extensions

Media applications operate on data types narrower than the native word size Limitations, compared to vector instructions:
Number of data operands encoded into op code No sophisticated addressing modes (stride, scatter-gather) No mask registers

CSL718

05-Apr-12

Graphics Processing Unit


Offers higher potential performance than traditional multicore computers. Heterogeneous execution model
CPU is the host, GPU is the device

Develop a C-like programming language for GPU Unify all forms of GPU parallelism as CUDA (Compute Unified Device Architecture) thread Programming model is Single Instruction Multiple Thread

CSL718

05-Apr-12

Comparison: Vector Architectures and GPUs


many lanes in GPU, therefore GPU chimes are smaller

compiler manages mask register explicitly in software

CSL718

Implicitly using branch synchronization markers and internal stack to save, complement and restore masks.

05-Apr-12

Vector processor and a multithreaded SIMD Processor of a GPU


Supplies scalar operands for scalarvector operations, increments addressing for unit and non-unit stride accesses to memory

one PC per SIMD thread

Ensures High Memory Bandwidth


CSL718 05-Apr-12

GPU have hardware support for Multithreading

VMIPS register holds the entire vector.

Vector is spread across the registers of SIMD lanes.

CSL718

05-Apr-12

Memory Latency is hidden by paying latency once per load/store instructions in Vector Architecture. GPU hides it using Multithreading. Conditional Branch Mechanism of GPU handles StripMining problem of Vector Architectures by iterating the loop until all the SIMD lanes reach the loop bound.

CSL718

05-Apr-12

Comparison: Multimedia SIMD Computers and GPUs

CSL718

Scalar processor and Multimedia instructions are separated by an I/O bus in GPUs with separate main memories.

05-Apr-12

Also, Multimedia SIMD instructions do not support scatter-gather memory accesses. In short it can be said that GPUs are multithreaded SIMD processors with more number of lanes, processors and better hardware for multi-threading.

CSL718

05-Apr-12

Thank You

CSL718

05-Apr-12

S-ar putea să vă placă și