HPC Note

CS4007/SCS4007 High Performance Computing
Malik Silva (mks@ucsc.cmb.ac.lk) June 3, 2013
High Performance Computing

Lecture 1: Introduction
Objectives: What is the performance divide? What is an embarrassingly parallel computation? Outline the basic serial performance improvement guidelines. What is common sub-expression elimination, strength reduction, loop invariant code motion, constant value propagation and evaluation and dependency elimination?
Part 1:
1. Denition: A simple denition for HPC is high-speed computing. It generally refers to the practice of aggregating computing power in a way that delivers much higher performance than one could get out of a typical desktop computer or workstation in order to solve large problems in science, engineering, or business. 2. Benet of HPC: The main benet of HPC is quicker processing. This allows for example to reduce the time required by automakers to develop new vehicle platforms from an average of 60 to 24 months, while greatly improving crashworthiness, environmental friendliness, and passenger comfort. HPC is also extensivly used for weather forecasts. Hospitals can use HPC to predict which expectant mothers will require surgery for Caesarean births, with the goal of avoiding traditional, riskier last-minute decisions during childbirth. HPC also help businesses of all sizes, particularly for transaction processing and data warehouses. Doing computations faster usually of use to almost everybody. 3. Performance divide: Given a computer, there is a usually a gap between the peak performance that the computer is capable of and the actual performance obtained by users for their applications. Main reason for this mismatch is the poorly designed user application programs that do not match the hardware of the computer it runs on. Reducing this gap for our programs is one of the goals of this course. 4. Serial and parallel programs A serial (sequential) program has a single process of execution while a parallel program has multiple processes of execution. 5. This course: The objective of this course is to give the skills to develop serial and parallel programs with good performance. The course will involve the following lectures:
Lecture 1: Part 1: Performance Divide, Serial and Parallel Programs, This course Part 2: Basic Serial Performance Optimization Guidelines Part 3: Guideline: Use of Compiler Optimizations Lecture 2 : Guideline: data locality improvement Lecture 3 : Part 1: Guideline: Use of a Good Computing Environment Part 2: Guideline: Selection of a Good Algorithm Lecture 4 : Hardware for Parallel Computing Lecture 5 : Parallel Program Design Lecture 6 : Basic MPI Lecture 7 : Further MPI Lecture 8 : In Class Test Lecture 9 : Debugging MPI Codes + Demos Lecture 10 Part 1: Part 2: Part 3: : Shared Memory Programming Using Pthreads Problem of Cache Coherence False Sharing
Lecture 11 : State Diagram Use Lecture 12 : Basic OpenMP Lecture 13 : Further OpenMP Lecture 14 : Assignment 2 Presentations Lecture 15 : Part 1: Selected Topic for 2013 Part 2: Summary of Course + Exam Introduction The evaluation will be through the two-hour nal exam (60%) and the continuous assessment (40%) which involves two assignments [one individual, one group] and a weekly individual quiz that will test some content from the previous weeks lecture.
Part 2: Basic Serial Performance Optimization Guidelines

Selection of best algorithm: For large data sets, advantages of using a low complexity algorithm (with lesser operations count) will be large. Use of ecient libraries: e.g., BLAS (Basic Linear Algebra Subroutines) (www. netlib.org). Good quality public domain library: LAPACK Optimal data locality: Important that faster, higher levels of the memory hierarchy are eciently used Use of compiler optimizations: e.g., cc -03 myprogram.c Using a medium optimization level (-O2 on many machines) typically leads to a speedup of factors of 2 to 3 without signicant increase in compilation time. Use a better computing environment: e.g., a more powerful system with good systems software.
Part 3: Guideline: Use of Compiler Optimizations

Compilers do optimizations on units called basic blocks. A basic block is a segment of code that does not contain branches. Common subexpression elimination: If a subexpression is common to two or more expressions, the compiler will pull it out and precalculate it. e.g.,: s1 = a + b + c s2 = a + b - c will be replaced by: t = a + b s1 = t + c s1 = t - c Strength reduction: Replacement of an arithmetic expression by another equivalent expression which can be evaluated faster. e.g., Replacement of 2 i by i + i since integer additions are faster than integer multiplications. Loop invariant code motion: e.g., Compiler will replace: do i=1,n a(i) = r*s*a(i) enddo with: t1=r*s do i=1,n a(i) = t1*a(i) enddo 3
Constant value propagation and evaluation: Whenever the compiler can determine that a variable takes on a constant value, it will replace all the occurrences of this variable by a constant value. Also, when an expression involves several constant values, they will be calculated and replaced by a new constant value. e.g., two = 2; x = 3*two*y; will be replaced by the compiler by: x = 6*y; Dependency elimination: Three types of dependencies: True dependence: t3 = t4 + t5 t6 = t7 - t3 Anti-dependence: t3 = t4 + t5 t4 = t6 - t7 Output dependence: t3 = t4 + t5 ... t3 = t6 - t7 The dependent instructions cannot be executed simultaneously. The second dependent instruction has to wait, usually a few cycles before the rst one is nished. The compiler attempts to order dependent instructions in such a way that they are separated by a few cycles or if possible to eliminate them. Consider the following true dependence. x s z t = = = = 1.0 * 2.0 * y; s+x; 2.0 * 3.0 * r; t+z;
This can be improved by: x = 1.0 * 2.0 * y; z = 2.0 * 3.0 * r; s = s+x; t = t+z; In the case of the anti-dependency, the compiler can write the second result in t2. In the case of the output dependency, the compiler can use t2 for one of the two results instead of using t3 for both. 4

28/1/2013; Lecture 2: Guideline: Data Locality Improvement
Objectives: Appreciate the importance of data locality What is meant by the memory hierarchy? What is a cacheline? Outline the dierence between direct mapped, set-associative and fully-assotiative caches Dene compulsory, conict and capacity cache misses What is meant by write-through and write-back caches? Dene sptial and temporal locality Use fusion, loop interchange, blocking, data laying to improve data locality Dene prefetching and give the requirements for good prefetching Use padding and array merging to avoid cache thrashing.
6. Memory wall problem: The processors are very powerful and are capable of high performance. However, this high performance of the processors is useless if they have to wait for the data to be brought in from distant places. It is like a very ecient worker (processor) idling while the data for him to process are being brought to him from distant places (memory, harddisk) (Figure 1). Not processing data, but moving data costly The gap between processor and memory speeds continues to widen (Figure 2). It is a major hindrance to performance. The gap is projected to increase in the future too. Thus it is important to overcome this. 7. Denitions: Memory hierarchy: A hierarchy of memory is provided; L1-cache, L2-cache, memory, harddisk, etc. Fast cache at top of hierarchy; Access times increase down hierarchy; Typical access speeds: Cache 10-20 ns Memory 90-120 ns Hard Disk 10,000,000 to 20,000,000 ns 5
Figure 1: The memory wall problem
Figure 2: The processor and memory access speed gap Fast caches - Expensive - Thus small An example of memory hierarchy usage is shown in Figures 3 and 4. 6
Figure 3: The memory access example. The processor wants to work with the variable total.
Figure 4: The memory access example. The value of the variable total is brought to L2 and L1 caches. There could be both instruction caches and a data caches. Or there could be a unied cache in which both the instructions and data are stored. Cache line: The smallest unit of data that can be transferred to or from memory (It is the data that ts the suitcase in Figure 1. It holds contents of a contiguous block of main memory. Cache lines are usually between 32 and 128 bytes. If data requested by processor are found in cache: cache hit if not: cache miss If cache miss, then new data must be brought in to cache. Cache Associativity: Species the number of dierent cache slots in which a 7
cache line can reside. Three Types: Direct-mapped: If the size of the cache is n words, then the i word in memory can be stored only in the position given by mod(i, n). Set-associative: In an N-way set-associative cache, the cache is broken into sets where each set contains N cache lines. Then, each memory address is assigned a set, and the relevant cacheline can be cached in any one of those slots within the set. Fully-associative:a cache line can be placed in any free slot in cache. Cache misses: Compulsory miss:Occurs when the cache line has to be brought into the cache when rst accessing it. Capacity miss: A miss that is not a compulsory miss but occurs in a fully associative cache. Conict miss: A miss that is neither a compulsory miss nor occurs in a fully associative cache, but only occurs in a set-associative or direct-mapped cache due to cache accesses which conict with each other. Cache line replacement: No big issue with direct-mapped caches. But for set-associative and fully-associative caches, some mechanism must be used. Ways of replacement: least-recently used, random, round-robin Write-back and Write-through caches: When a processor performs a store instruction, it typically writes data into the cache-resident line containing the address. Two policies for dealing with the cache line subsequent to its having data written to it. Write-through: Data is written to both the cache line in the cache and to main memory. Write-back: Data is written only to the cache line in the cache. This modied cache line is written to memory only when necessary (usually when it is replaced in the cache by another cache line). Data Locality: Memory hierarchy can be a solution to memory wall problem if there exists cache residence of data. One way to improve cache residence is to improve the data locality in the program. Two types of data locality: spatial and temporal Consider variables: ABCDE... Spatial locality: nearby memory locations are accessed (e.g., ABC...) Temporal locality: multiple accesses to same memory location (e.g., AAA...)
8. Improving Data Locality Fusion When same data is used in separate sections of the code, bring them close if possible. Improves temporal locality. e.g.1, 8
Before: ... ...A... ... ... ... ...A... ... After: ... ...A... ...A... ... ... e.g.2, Before: Matrix A Matrix B Matrix A Matrix B After: Matrix Matrix Matrix Matrix usage usage usage usage
A A B B
usage usage usage usage
Prefetching Moving data up the memory hierarchy before they are actually needed. See gure 5. Good data prefetching: Little overhead. Correct: data should be used in future. No pollution: Should not take away from cache the data that is being presently used. Just-in-time: Should neither be too early (as it may be evicted before use) nor too late. Loop Interchange Reorder loops to align the access pattern in loop with pattern of data storage in memory. e.g., In C: for (i=0;i<4;i++) for (j=0;j<4;j++) a[j][i] = ... 9
Figure 5: Prefetching example. The processor is working at present with total. The variable pressure having the value 9.1, is a future usage. However, it is also prefetched into the cache hierarchy. After loop interchange: for (j=0;j<4;j++) for (i=0;i<4;i++) a[j][i] = ... Blocking Instead of operating on entire rows or columns of a matrix which may be too big to t in cache, operate on sub-matrices or data blocks (or tiles). Improves temporal locality. Standard matrix multiplication: for (i=0; i<N; i++) for (j=0; j<N; j++) { c[i][j] = 0.0; for (k=0; k<N; k++) c[i][j] = c[i][j] + a[i][k]*b[k][j]; /* Makes C row by row */ }
After blocking: for (p=0; p<NB; p++){ for (q=0; q<NB; q++) { for (r=0; r<NB; r++) { for (i=p*NEIB;i<p*NEIB+NEIB;i++) 10
for (j=q*NEIB;j<q*NEIB+NEIB;j++) for (k=r*NEIB;k<r*NEIB+NEIB;k++) c[i][j]=c[i][j] + a[i][k]*b[k][j]; /*makes C, block-row by block-row */ } } } Note: NB: Number of blocks; NEIB: Number of elements in a block Good Data Layouts: General way: Data is laid in memory according to the data denitions in program. It may not be the way the data is actually accessed. Data laying: Lay data in memory according to access pattern. Improves spatial locality. e.g., Consider a 4X4 matrix multiplication: AB = C Row-major storage for matrices A,B,C: A:1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16 B:17 21 25 29 18 22 26 30 19 23 27 31 20 24 28 32 C:33 37 41 45 34 38 42 46 35 39 43 47 36 40 44 48
A better layout for 2X2 blocked multiplication: A:1 2 5 6 B:17 19 25 27 C:33 34 3 4 7 8 18 20 26 28 35 36 9 10 13 14 21 23 29 31 41 42 11 12 15 16 22 24 30 32 43 44 Example code fragment:
37 39 45 47
38 40 46 48
/* Data laid code fragment for matrix multiplication #define N 2048 /* Matrix Size */
*/
#define NB 2 /* no.of blocks across matrix and also down matrix */ #define NEIB N/NB /* no. of elements in a direction */ #define BLOCKSIZE NEIB*NEIB /* no. of elements in a block */ float a[N*N]; float b[N*N]; float c[N*N];
int main(int argc, char *argv[]) { int i,j,k; 11
int p,q,r; int istart, jstart, jmainstart, kstart,m,n,o, ilimit; /* generate mxs */
for (i=0; i<N*N; i++) { a[i] = 1.0; b[i] = 2.0; c[i] = 0.0; } assert(N%NB==0); for (p=0; p<NB; p++){ istart = p*BLOCKSIZE*NB; jmainstart = istart; jstart = jmainstart; j = jstart; for (q=0; q<NB; q++) { kstart = q*BLOCKSIZE*NB; k = kstart; for (r=0; r<NB; r++) { /* within an r the update of a block is done */ ilimit = istart + BLOCKSIZE; for (i=istart;i<ilimit;i++) { for (m=0;m<NEIB;m++) { c[i]=c[i] + a[j]*b[k]; j++; k++; } /* m loop end */ o = (i+1) % NEIB; if (o == 0) { if ((i+1)<ilimit) { k = kstart; jstart = jstart + NEIB; j = jstart; } } else j = jstart; } /* i loop end */ jstart = j; kstart = k; } /* r loop end */ istart = i; jstart = jmainstart; j = jstart; } /* q loop end */ 12
} /* p loop end */
/* Two results */ printf("%.3f\t",c[0]); printf("%.3f\n",c[N*N-1]); }
9. Reducing cache conicts (That occur in direct mapped or set-associative caches) Sometimes, a single instruction or adjacent instructions operate on data separated in the address space by a distance equal to some integer multiple of the cache size. In direct-mapped caches, such data entries will fall on the same cache location (slot), creating a conict and impeding the performance. Set-associative caches are designed to reduce this problem, but conicts still occur if many data items (exceeding the set associativity of the cache) compete for the same location. Consider the following example: #define N 2048 double y[N], x[N]; ... for (i=0;i<N;i++) y[i] = x[i] + y[i]; Assume that we are using a 16kB direct-mapped L1 cache. Then the amount of storage that an array (either x or y) should require = (2048*8) / 1024 kB = 16 kB i.e., the two data entries in the above program are separated by 16kB (the size of our L1 cache). Also assume that the cacheline size of our machine is 32 bytes (and thus holds 4 elements of an array) Now let us see how this program performs. The underlined statements are cache misses: Load x[0] (x[0] is loaded into a register) Load y[0] (y[0] is loaded into a register) Add x[0] to y[0] Store y[0] Load x[1] Load y[1] Add x[1] to y[1] Store y[1] 13
Thus we see conict (and compulsory) misses in it. Cache conicts can be avoided through padding. This is based on the observation that when N is either greater than 2051 or less than 2045, the two data items would fall on dierent cache slots. i.e., we do for example: #define N 2052 Now let us see the performance of the same program: Load x[0] Load y[0] Add x[0] to y[0] Store y[0] Load x[1] Load y[1] Add x[1] to y[1] Store y[1] Thus we see a reduction of misses. Compilers can pad arrays to avoid cache conicts. Another technique to reduce conict misses is array merging: Referencing multiple arrays with the same indices May lead to conict misses for some array sizes. Remedy: Merging independent matrices into a single array As an example of the use of this technique, consider the following loop (FORTRAN): For I=1,n do C(I) = A(I)+B(I) Endfor In the extreme case when A(I),B(I) and C(I) each map to the same cache slot, this slot will have to be constantly emptied and relled leading to cache thrashing. A x for this problem is to merge A, B and C into a single array, R. Then, For I=1,n do R(3,I) = R(1,I)+R(2,I) Endfor When R is stored in column-major order (as in FORTRAN), this technique drastically reduces the number of cache misses. Quiz series for Assignment 2: Quiz 1. Give a short introduction to BLAS (Basic Linear Algebra Subroutines) [from Lecture 1]. Quiz 2. Give a presentation to explain how blocking improves temporal locality. 14

11/2/2013; Lecture 3
Objectives: Outline how one could create a good computing environment. Explain loop interchange. Explain loop fusion. Explain loop ssion. Explain loop peeling. Explain loop unroll. Given a code fragment (in C), writing an optimized version for it.
Part 1: Guideline: Use of a Good Computing Environment

10. There are many ways to optimize the performance of your computer. Following are some of them: Increase the memory in the computer: Speed up access to data through disk defragmentation: Disk fragmentation slows the overall performance of a system. When les are fragmented, the computer must search the hard disk as a le is opened (to piece it back together). The response time can be signicantly longer. Disk defragmentation consolidates fragmented les and folders on a computers hard disk so that each occupies a single space on the disk. Use Linux (?) This recommendatation is controversial. But working in a Linux environment will free you of a lot of worries (e.g., viruses). Free up disk space: Free up your disk by removing unwanted les. Have a good backup scheme:
15
Part 2: Guideline: Selection of a Good Algorithm

11. Loop Optimizations Loop Interchange Loops are reordered to minimize the stride1 and align the access pattern in the loop with the pattern of data storage in memory. Loop Fusion Adjacent or closely located loops are fused into one single loop. void nofusion() { int i; for (i=0;i<nodes;i++) { a[i] = a[i] * small; c[i] = (a[i]+b[i])*relaxn; } for (i=1;i<nodes-1;i++) { d[i] = c[i] - a[i]; } } void fusion() { int i; a[0] = a[0]*small; c[0] = (a[0]+b[0])*relaxn; a[nodes-1] = a[nodes-1]*small; c[nodes-1] = (a[nodes-1]+b[nodes-1])*relaxn; for (i=1;i<nodes-1;i++) { a[i] = a[i] * small; c[i] = (a[i]+b[i])*relaxn; d[i] = c[i] - a[i]; } }
Stride = the increment in memory address between successive elements addressed.
16
Loop Fission Split the original loop if it is worthwhile. e.g.,1. void nofission() { int i, a[100], b[100]; for (i=0;i<100;i++) { a[i]=1; b[i]=2; } } void fission() { int i, a[100], b[100]; for (i=0;i<100;i++) { a[i]=1; } for (i=0;i<100;i++) { b[i]=2; } } The goal in the above example is to break down a large loop body into smaller ones to achieve better utilization of locality of reference. Loop Peeling Peel-o the edge iterations of the loop. Before peeling: for (i=1;N;i++) { if (i==1) x[i]=0; else if (i==N) x[i]=N; else x[i]=x[i]+y[i]; }
17
After peeling: x[i]=0; for (i=2;i<N;i++) x[i]=x[i]+y[i]; x[N]=N; Loop Unrolling Reduces the eect of branches. e.g., Before unrolling: do i=1,N y[i] = x[i] enddo After unrolling by a factor of four: nend = 4*(N/4) do i=1,N,4 y[i] = x[i] y[i+1] = x[i+1] y[i+2] = x[i+2] y[i+3] = x[i+3] enddo do i=nend+1,N y[i] = x[i] enddo Quiz series for Assignment 2: Quiz 3. Give a presentation on how conict misses can occur in set-associative caches. Quiz 4. Give a presentation on how one could improve the performance of his computer (e.g., though increasing memory etc.).
18
12. There are formal techniques to analyze the eciency of algorithms in terms of time and space eciency. (Quiz 5) Quiz series for Assignment 2: Quiz 5. Give a presentation on the use of the big O notation to analyze algorithms. (Chapter 2 of the book Fundamentals of Algorithmics by Giles Brassard and Paul Bratley [available in Library].)

Lecture 4; 18/2/2013: Hardware for Parallel Computing
Objectives: Describe the eect of a fork. Dene speedup of a parallel computation. Derive Amdhals law. Dene multi-processor, and multi-core computers Explain the concept of hyperthreading. Dene types of distributed computing.
13. Part 1: Introduction Process: An instance of an executing program Present computers can handle several processes simultaneously Each linux process is given a unique process id. e.g., % ps PID TIME COMMAND 2345 0.12 inventory 2346 0.01 payroll fork(): Creates a new process by duplicating the calling process e.g., #include <stdio.h> main() { int x; printf(Just one process so far\n); printf(calling fork...\n); 19
x = fork(); /* create new process */ if (x==0) printf(I am the child\n); else if (x>0) printf(I am the parent. Child has pid:%d\n,x); else printf(Fork returned error code; no child\n); } New process runs a copy of the same program as its creator, with the variables within this having the same values as those within the calling process Address space of a child is a replica of its parent process that called fork(): parent newly created process: child The parent resumes execution and the child starts execution at the same place where the call returns. Return values of fork: for parent: process id of child or a negative value for any error child: 0 Note: Never get the spellings of these two wrong! process: tasks processors: hardware central processing units Speedup (S)
imeU singOneP rocessor(BestP ossibleSerialT ime) ts = tp S (n) = ExecutionT ExecutionT imeOnAP arallelComputerW ithnP rocessors e.g., if the serial time of a computation is 4 minutes and the time taken on a parallel computer is 2 minutes, then the achived speedup is 4/2 = 2.
Eciency (E) =
S n
Factors that limit speedup: Periods when not all the processors can be performing useful work and are idle (includes when only one processor is active on inherently serial parts of the computation) Extra computations in the parallel version not appearing in the serial version Time for communication/synchronization Maximum speedup? If the fraction of computation that cannot be divided into concurrent tasks is f, and no overhead incurs when the computation is divided into concurrent parts, the time to perform the computation with n processors is: f )ts f ts + (1n ts Hence the speedup is: S (n) = (1f )ts
f ts+
n
n 1+(n1)f
(Amdhals Law) 20
Granularity: Size of the computation between communication and synchronization points. (Large computation size: coarse granularity; small computation size: ne granularity) 14. Part 2: Hardware (a) Multi-tasking Single Processor Computers Multiple processes can be run on a time-shared manner on a general single processor computer. Here, several processes reside in memory with the CPU being allocated to each in turn for a period giving user the illusion that all run together. Note: Dicult to get a speed-up unless while one process is doing computing the other processes are involved in input/output for example. (b) Shared Memory Computers Multiprocessor Congurations (usually called SMPs [Symmetric Multiprocessors]) This involves a multiprocessor computer hardware architecture where two or more identical processors are connected to a single shared memory and are controlled by a single OS instance. The gure 6 shows a dual processor system containing two separate physical computer processors in the same chassis. The two processors are usually located on the same circuit board (mother board).
Figure 6: Dual Processor System (Source:www.ddj.com) In a SMP multiple identical processors share memory connected via a bus. Bus contention prevents bus architectures from scaling. As a result, SMPs generally do not comprise more than 32 processors. Hyperthreading (Simultaneous Multithreading [SMT]) With HT Technology, two threads can execute on the same single processor core simultaneously in parallel rather than context switching 2 between the threads. Scheduling two threads on the same physical processor core allows better use of the processors resources.
2
the term given to indicate OS initiated process switching
21
HT Technology adds circuitry and functionality into a traditional processor to enable one physical processor to appear as two separate processors. Each processor is then referred to as a logical processor. The added circuitry enables the processor to maintain two separate architectural states and separate Advanced Programmable Interrupt Controllers (APIC) which provides multi-processor interrupt management and incorporates both static and dynamic symmetric interrupt distribution across all processors. Hyper-threading duplicates about 5% of the cpu circuitry. The shared resources include items such as cache, registers, and execution units to execute two separate programs or two threads simultaneously. This means that the execution unit is time-shared by both threads concurrently and the execution unit continuously makes progress on both threads. Results in a more fully utilized CPU. All main processor makers now provide some form of multithreading. Although hyperthreading technology is implemented on a single CPU, the OS recognizes two logical processors and schedules tasks to each logical processor. (can attain about 30% performance improvements for a variety of codes). A CPU with hyper-threading has two sets of the circuits which keep track of the state of the cpu. This includes most of the registers and instruction pointer. These circuits do not accomplish the actual work of the CPU, they are temporary storage facilities where the CPU keeps track of what it is currently working on. The vast majority of the CPU remains unchanged. The portions of the CPU which do computational work are not replicated and nor are the L1, L2 caches. Requirements to enable HT Technology are a system equipped with a processor with HT Technology, an OS that supports HT Technology and BIOS support to enable/disable HT Technology.
Figure 7: Processor equipped with Hyper-Threading Technology(Source:www.ddj.com) Note that it is also possible to have a dual processor system that contains two HT Technology enabled processors which would provide the ability to run up to 4 programs or threads simultaneously. Dual Core 22
This term refers to integrated circuit (IC) chips that contain two complete physical computer processors (cores) in the same IC package. Typically, this means that two identical processors are manufactured so they reside side-by-side on the same die. It is also possible to (vertically) stack two separate processor die and place them in the same IC package. Each of the physical processor cores has its own resources (architectural state, registers, execution units, etc.). The multiple cores on-die may or may not share several layers of the ondie cache. A dual core processor design could provide for each physical processor to: 1) have its own on-die cache, or 2) it could provide for the on-die cache to be shared by the two processors, or 3) each processor could have a portion of on-die cache that is exclusive to a single processor and then have a portion of on-die cache that is shared between the two dual core processors.
Figure 8: Dual Core System (Source:www.ddj.com) Note that dual core processors could also contain HT Technology which would enable a single processor IC package, containing two physical processors, to appear as four logical processors capable of running four programs or threads simultaneously. e.g., of dual core processors: AMD Phenom II X2, Intel Core Duo Multi Core: The multi core system is an extension to the dual core system except that it would consist of more than 2 processors. The current trends in processor technology indicate that the number of processor cores in one IC chip will continue to increase. If we assume that the number of transistors per processor core remains relatively xed, it is reasonable to assume that the number of processor cores could follow Moores Law, which states that the number of transistors per a certain area on the chip will double approximately every 18 months. Even if this trend does not follow Moores Law, the number of processor cores per chip appears destined to steadily increase - based on statements from several processor manufacturers. The optimal number of processors is yet to be determined, but will probably change over time as software adapts to eectively use many processors, simultaneously. However, 23
a software program that is only capable of running on one processor (or very few processors) will be unable to take full advantage of future processors that contain many processors cores. For example, an application running on a 4-processor system with each socket containing quad-core processors has 16 processor cores available to schedule 16 program threads simultaneously. e.g.: a quad-core processor contains four cores (e.g. AMD Phenom II X4, the Intel 2010 core line that includes three levels of quad-core processors, namely i3, i5, and i7), a hexa-core processor contains six cores (e.g. AMD Phenom II X6, Intel Core i7 Extreme Edition 980X), an octa-core processor containes eight cores (e.g. AMD FX-8150). Many Core: is a multi-core processor in which the number of cores is large enough that traditional multi-processor techniques are no longer ecient largely because of issues with congestion in supplying instructions and data to the many processors. The many-core threshold is roughly in the range of several tens or hundreds of cores Graphics Processing Units (GPUs) Graphics cards (often having 100+ processor cores) and a rich structure of memory that they can share is a good general purpose computing platform. Extensions to C that allow one to access computing capabilities on these cards, called CUDA have been created. Each processor can do less than your CPU, but with their powers combined they become a fast parallel computer. (c) Distributed Computing A distributed computer (also known as a distributed memory multiprocessor or multicomputer) is a distributed memory computer system in which the processing elements are connected by a network. Cluster computing A cluster is a group of loosely coupled computers that work together closely, so that in some respects they can be regarded as a single computer.Clusters are composed of multiple standalone machines connected by a network. While machines in a cluster do not have to be symmetric, load balancing is more difcult if they are not. The most common type of cluster is the Beowulf cluster, which is a cluster implemented on multiple identical commercial o-the-shelf computers connected with a TCP/IP Ethernet local area network.(The vast majority of the TOP500 supercomputers are clusters.) Massively parallel processing A massively parallel processor (MPP) is a single computer with many networked processors. MPPs have many of the same characteristics as clusters, but MPPs have specialized interconnect networks (whereas clusters use commodity hardware for networking). MPPs also tend to be larger than clusters, typically having far more than 100 processors. In an MPP, each CPU contains its own memory and copy of the operating system and application. Each subsystem communicates with the others via a high-speed interconnect. Grid computing Grid computing is the most distributed form of parallel computing. It makes use of computers communicating over the Internet to work on a given problem. 24
Because of the low bandwidth and extremely high latency available on the Internet, grid computing typically deals only with embarrassingly parallel problems. Most grid computing applications use middleware, software that sits between the operating system and the application to manage network resources and standardize the software interface. Keeps track of site congurations - OS, hardware, installed applications. Keeps track of current load situation - available CPUs, queue lengths, free disk space. Sends jobs to a suitable site and submits them to the local batch system.Handles stage-in and stage-out of les. Often, grid computing software makes use of spare cycles, performing computations at times when a computer is idling. Motivation: Most of cpu power is unused most of the time (possibly as much as 90%). Some problems are too big for a single machine.Some jobs take too long on one machine. Is directed towards embarrasingly obvious parallelism. High throughput; not high performance. High throughput is the term used to describe applications with massive amounts of data, that typically can be processed in very small chunks, and for which the processing is not time critical (terabytes of data). High performance denotes an application where the wall clock time of the complete computation is important (teraops of processing) Grid example: Swegrid (www.swegrid.se) Six clusters of 100 nodes each at six dierent sites; Gigabit ethernet interconnect; Bunch of storage (aprox.60 Tb) Software: Dierent Linux versions. Dierent environments, according to site policies. ARC middleware to glue stu together. ARC (Advanced Resource Connector): Middleware developed by Nordugrid project. Keeps track of site congurations - OS, hardware, installed applications. Keeps track of current load situation - available CPUs, queue lengths, free disk space. Sends jobs to a suitable site and submits them to the local batch system.Handles stage-in and stage-out of les. ARC is built upon GlobusToolkit libraries. Usage: Get and install a certicate from certicate authority: grid-cert-request Log on to the grid: grid-proxy-init Enter your request in a .xrsl le: myexperiment.xrsl &(executable=myexecutablecode) (stdout=myoutputfile) (jobname=Gridjob1) (cputime=10) (memory=32) 25
To submit job to grid: ngsub -f myexperiment.xrsl -c swelanka.ucsc.cmb.ac.lk To check status of job: ngstat Gridjob1 Download the result les: ngget Gridjob1 ngkill Gridjob1 Note: Dierence between grid and cloud computing (source: the web) Grids are used as computing/storage platform. We start talking about cloud computing when it oers services. Cloud is sort of a higher-level grid. As far as application domains go, grids require users (developers mostly) to actually create services from low-level functions that grid oers. Cloud will oer complete blocks of functionality that you can use in your application. Example (you want to create physical simulation of ball dropping from certain height): Grid: Study how to compute physics on a computer, create appropriate code, optimize it for certain hardware, think about paralellization, set inputs send application to grid and wait for answer Cloud: Set diameter of a ball, material from pre-set types, height from which the ball is dropping, etc and ask for results Quiz series for Assignment 2: Quiz 6. Worlds fastest 500 supercomputers (updated twice an year) is maintained at the site www.top500.org. Give a presentation on the current leader of this list.
Lecture 5; 4/3/2013: Parallel Program Design (Part 1)

Objectives: Describe data and functional parallelism.
15. Data and Functional Parallelism In parallel processing we simply partition the problem into parts. Partitioning can be applied to the program data (data partitioning or domain decomposition or data parallellism). Here, the data is divided and we concurrently operate on the divided data. Partitioning can also be applied to the functions of a program (functional decomposition or functional parallelism). Here, the program is divided into independent functions and the functions are executed concurrently. Example of data parallelism: Suppose a sequence of numbers x1,...,xn needs to be added. We can consider dividing the sequence into m parts of n/m numbers each at which point m processes can each add one part independently to create partial sums. The m partial sums need to be added together to form the nal sum.
26
Figure 9: An example of functional parallelism Example of functional parallelism: In a quality checking application, interfacing to a human, contolling a conveyor belt, capturing images of items, processing the images, detecting defects and transferring the data to a storage area network represent functional parallelism. (Note that in this quality checking application, the processing of images is a candidate for data parallelism.)
27
Lecture 6; 11/3/2013: MPI (Message Passing Interface)

Objectives: Write and run MPI codes.
16. MPI Basics Library routines for message passing (dierent implementations, e.g., mpich, LAM/MPI) Sample MPI code: Let following be the prog.c file. main(int argc, char *argv[]) { MPI_Init(&argc, &argv); ... MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /* find process rank */ if (myrank= = 0) master(); else slave(); ... MPI_Finalize(); } mpich compilation and running: To compile: mpicc -o prog prog.c To run: mpirun -np 4 -machinele machines prog Each process executes same code. Mapping of processes to processors through a machines le e.g., 10.16.66.74 10.16.66.75 Code must be iniialized with MPI Init() and nished with MPI Finalize() MPI Communicator:Initially all processors are enrolled in a universe called MPI COMM WORLD, and each process is given a rank, a number from 0 to n-1 where n = number of processors Communicators dene the scope of a communication operation All MPI routines: MPI X (X- an uppercase letter) MPI denes about 120 fns; But many MPI programs could be written with just six 28
MPI Init: Initialize MPI environment MPI Comm size: identify all available processes MPI Comm size(MPI Comm comm, int *size): Determines size of group associated with a communicator; comm: communicator; *size: size of group (returned) MPI Comm rank: identify my process number MPI Comm rank(MPI Comm comm, int *rank): Determines rank of process in communicator; comm: communicator; *rank: rank (returned) MPI Finalize: Terminate MPI execution environment MPI Send: Send message to a single process MPI Recv: Receive a message from another process Another very simple MPI program: #include <stdio.h> #include <mpi.h> int main(int argc, char *argv[]) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello world! I am %d of %d\n", rank, size); MPI_Finalize(); return 0; } 17. Point-to-point Message Passing MPI denes various datatypes for MPI Datatype, mostly with corresponding C datatypes including, MPI CHAR: signed char MPI INT: signed int MPI FLOAT: oat MPI Send(void *buf, int count, MPI Datatype datatype, int dest, int tag, MPI Comm comm): Sends message (blocking) *buf: send buer; count: number of entries in buer; datatype: data type of entries; dest: destination process rank; tag: message tag; comm: communicator
29
MPI Recv(void *buf, int count, MPI Datatype datatype, int source, int tag, MPI Comm comm, MPI Status *status): Receives message (blocking) *buf: receive buer (loaded); count: max number of entries in buer; datatype: data type of entries; source: source process rank (MPI ANY SOURCE matches with anything); tag: message tag (MPI ANY TAG matches with anything); comm: communicator *status: status (returned) MPI Status is a structure with at least three members: statusMPI SOURCE (rank of source of message) statusMPI TAG (tag of source message) statusMPI ERROR (potential errors) Example of MPI Send and MPI Recv: e.g., To send an integer x from process 0 to process 1, MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /* find process rank */ if (myrank == 0) { int x = 2; MPI_Send(&x,1,MPI_INT,1,3, MPI_COMM_WORLD); } else if (myrank = = 1) { int x; MPI_Recv(&x,1,MPI_INT,0,3,MPI_COMM_WORLD,&status); } 18. Group Routines MPI Barrier(MPI Comm comm): Blocks until all processes have called it comm: communicator MPI Bcast(void *buf, int count, MPI Datatype datatype, int root, MPI Comm comm): Broadcasts message from root process to all processes in comm and itself *buf: message buer (loaded); count: number of entries in buer; datatype: data type of buer; root: rank of root MPI Gather(void *sendbuf, int sendcount, MPI Datatype sendtype, void *recvbuf, int recvcount, MPI Datatype recvtype, int root, MPI Comm comm): Gathers values for group or processes *sendbuf: send buer; sendcount: number of send buer elements; sendtype: data type of send elements; *recvbuf: receive buer (loaded); recvcount: number of elements received; recvtype: datatype of receive elements; root: rank of receiving process; comm: communicator MPI Scatter(void *sendbuf, int sendcount, MPI Datatype sendtype, void *recvbuf, int recvcount, MPI Datatype recvtype, int root, MPI Comm comm): Scatters a buer from root in parts to a group of processes *sendbuf: send buer; sendcount: number of elements sent to each process; sendtype: data type of elements; *recvbuf: receive buer (loaded); recvcount: number of recv buer elements; recvtype: datatype of receive elements; root: root process rank; comm: communicator 30
MPI Reduce(void *sendbuf, void *recvbuf, int count, MPI Datatype datatype, MPI Op op, int root, MPI Comm comm): Combines values on all processes to single value *sendbuf: send buer address; *recvbuf: receive buer address; count: number of send buer elements; datatype: data type of send elements; op: reduce operation. e.g., MPI MAX: maximum, MPI MIN: Minimum, MPI Sum: Sum, MPI PROD: Product; root: root process rank for result; comm: communicator 19. A Sample MPI Program /* Sample program to add a group of numbers stored in a file */ #include "mpi.h" #include <stdio.h> #include <math.h> #define MAXSIZE 1000 void main(int argc, char *argv) { int myid, numprocs; int data[MAXSIZE], i, x, low, high, myresult, result; char fn[255]; char *fp; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_Comm_rank(MPI_COMM_WORLD, &myid); if (myid == 0) { /* open input file and initialize data */ #ifdef FILEINPUT strcpy(fn,getenv("HOME")); strcat(fn, "/MPI/rand_data.txt"); if ((fp = fopen(fn,"r")) == NULL) { printf("Cant open the input file: %s\n\n", fn); exit(1); } for (i = 0; i < MAXSIZE; i++) fscanf(fp, "%d", &data[i]); #else for (i = 0; i < MAXSIZE; i++) data[i]= 10; #endif } /* broadcast data */ MPI_Bcast(data, MAXSIZE, MPI_INT, 0, MPI_COMM_WORLD); /* Add my portion of data */ x = MAXSIZE/numprocs; 31
low = myid * x; high = low + x; for (i = low; i < high; i++) myresult += data[i]; /* Compute global sum */ MPI_Reduce(&myresult, &result, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) printf("The sum is %d.\n",result); MPI_Finalize(); } 20. LAM/MPI: Another MPI implementation (www.lam-mpi.org) Booting: User creates a text le (boot schema) containing participating machines format:<machine>[cpu=<cpucount>] [user=<userid>] e.g., Let this le be named bhost and is as follows: swelanka1.ucsc.cmb.ac.lk cpu=2 swelanka2.ucsc.cmb.ac.lk To verify whether a cluster is bootable: recon -v bhost To start LAM on the specied cluster: lamboot -v bhost Note: lamboot starts a process on each of the specied machines; each machine allocates a dynamic port and communicates it back to lamboot which collects them. Then lamboot gives each machine the list of machines/ports in order to form a fully connected topology. To check the network: ping swelanka1.ucsc.cmb.ac.lk tping [-hv] [-c <count>] [-d <delay>] [-l <length>] <nodes> e.g., tping -v n7 -l 1000 -c 10 Echo 1000 byte messages to node 7. Stay silent while working. Stop after 10 roundtrips and report statistics. tping n1-2 -l 1000: Echo 1000 byte messages to nodes n1 and n2. Compiling MPI: mpicc -o hello hello.c (Or mpif77 -o hello hello.f) Running MPI: mpirun -np 2 hello 32
mpirun n0-3 hello (runs one copy of the executable hello on nodes 0 through 3) mpirun c0-4 hello (runs one copy of the executable hello on cpus c0 through c4) Monitoring MPI Applications: mpitask: to monitor mpi processes under LAM mpimsg: monitor mpi message buers under LAM To clean entire LAM system: lamclean -v To terminate LAM: lamhalt In the case of node failures, lamhalt can hang. In those cases use: wipe -v bhost (note: bhost is the boot schema that we used to boot LAM) 21. Another code to illustrate MPI Send and MPI Recv: /* * Open Systems Lab * http://www.lam-mpi.org/tutorials/ * Indiana University * * MPI Tutorial * The cannonical ring program * * Mail questions regarding tutorial material to mpi@lam-mpi.org */ #include <stdio.h> #include "mpi.h"
int main(int argc, char *argv[]) { MPI_Status status; int num, rank, size, tag, next, from; /* Start up MPI */ MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); /* /* /* /* Arbitrarily choose 201 to be our tag. Calculate the */ rank of the next process in the ring. Use the modulus */ operator so that the last process "wraps around" to rank */ zero. */ 33
tag = 201; next = (rank + 1) % size; from = (rank + size - 1) % size; /* If we are the "console" process, get a integer from the */ /* user to specify how many times we want to go around the */ /* ring */ if (rank == 0) { printf("Enter the number of times around the ring: "); scanf("%d", &num); --num; printf("Process %d sending %d to %d\n", rank, num, next); MPI_Send(&num, 1, MPI_INT, next, tag, MPI_COMM_WORLD); } /* /* /* /* /* /* Pass the message around the ring. The exit mechanism works */ as follows: the message (a positive integer) is passed */ around the ring. Each time is passes rank 0, it is decremented. */ When each processes receives the 0 message, it passes it on */ to the next process and then quits. By passing the 0 first, */ every process gets the 0 message and can quit normally. */
while (1) { MPI_Recv(&num, 1, MPI_INT, from, tag, MPI_COMM_WORLD, &status); printf("Process %d received %d\n", rank, num); if (rank == 0) { num--; printf("Process 0 decremented num\n"); } printf("Process %d sending %d to %d\n", rank, num, next); MPI_Send(&num, 1, MPI_INT, next, tag, MPI_COMM_WORLD); if (num == 0) { printf("Process %d exiting\n", rank); break; } } /* The last process does one extra send to process 0, which needs */ /* to be received before the program can exit */
34
if (rank == 0) MPI_Recv(&num, 1, MPI_INT, from, tag, MPI_COMM_WORLD, &status); /* Quit */ MPI_Finalize(); return 0; } 22. Some more MPI routines: double MPI Wtime(void) : Returns elapsed time from some point in past, in seconds e.g., double startwtime = 0.0, endwtime;
startwtime = MPI_Wtime(); /* do some work */ endwtime = MPI_Wtime(); printf("wall clock time = %f\n",endwtime-startwtime); MPI Isend (void *buf, int count, MPI Datatype datatype, int dest, int tag,MPI Comm comm, MPI Request *request): Starts a non-blocking send *buf: send buer; count: number of buer elements; datatype: data type of elements; dest: destination rank; tag: message tag; comm: communicator; *request: request handle (returned) MPI Irecv (void *buf, int count, MPI Datatype datatype, int source, int tag,MPI Comm comm, MPI Request *request): Starts a non-blocking receive *buf: receive buer address; count: number of buer elements; datatype: data type of elements; source: source rank; tag: message tag; comm: communicator; *request: request handle (returned) MPI Wait (MPI Request *request, MPI Status *status): Waits for a send or receive to complete and then returns *request: request handle; *status: status (same as return status of MPI Recv()) MPI Test (MPI Request *request, int *ag, MPI Status *status): Tests for completion of a nonblocking operation *request: request handle; *ag: true if operation completed (returned); *status: status (returned) e.g. 1, MPI_Request request; MPI_Status status; int flag, address = 0; ... MPI_Isend(&address, 1, MPI_INT, 2, 400, ... 35
MPI_COMM_WORLD,&request);
MPI_Test(&request, flag, &status); if (flag) { ... } else { ... } e.g. 2, MPI_Isend(&address, 1, MPI_INT, 2, 400, ... MPI_Wait(&request, flag, &status);
MPI_COMM_WORLD,&request);
MPI Probe (int source, int tag, MPI Comm comm, MPI Status *status): Blocking test for a message source: source process rank; tag: message tag; comm: communicator; *status: status (returned) MPI Iprobe (int source, int tag, MPI Comm comm, int *ag, MPI Comm *status): Nonblocking test for a message source: source process rank; tag: message tag; comm: communicator; *ag: true if there is a message (returned); *status: status (returned) 23. Basic Techniques for Optimizing Message Passing Parallel Programs: (a) Balance the work load among the processors (load balancing) (b) Amount of data in the messages can be increased to lessen the eect of startup times. Note: Parallel execution time = Computation time + Communication time Communication time = T startup + nT data n - Number of data words T startup: Time to send a message with no data; includes time to pack the message at source and unpack it at destination; assumed to be a constant (message latency) T data: Transmission time to send one data word; assumed to be a constant Startup and transmission times depend on the computer system. Startup time is typically greater than the transmission time and also much greater than the arithmetic operation time. In practice it is the startup time that will dominate the communication time in many cases. (e.g., the computer may execute 200 oating point operations in the time taken for message startup) (c) Try to reduce communication: e.g., It may be better to recompute values locally than send computed values in additional messages from one process to other processes needing these values (d) Overlap communication with computation: e.g., non-blocking communication routines 36
using multiple processes on a processor (can switch to another process when the rst process is stalled due to incomplete communication or for other reason) Note: When an m-process algorithm is implemented on an n-processor machine (with m > n), the algorithm is said to have a parallel slackness of m/n for that machine.
Lecture 8; 22/4/2013: Shared Memory Programming (Part 1)

Note: Lecture 7 was done in the lab on 1/4/2013. Objectives: Write and run pthread codes.
24. Traditionally, upon fork() all resources owned by the parent are duplicated and the copy is given to child. Linux implements fork() via the clone() system call. This call takes a series of ags that specify which resources, if any, the parent and child process should share. In linux a thread is dened as a process that shares certain resources with other processes. Threads are created like normal tasks, with the exception that the clone() system call is passed ags corresponding to specic resources to be shared: e.g., clone(CLONE_VM, 0); Threads are suitable to write parallel software for shared-memory computers. 25. Pthreads: A library to do thread programming Code must be compiled with -lpthread e.g., cc -o myprogram myprogram.c -lpthread Sample program: #include <stdio.h> #include <pthread.h> #define ARRAYSIZE 1000 #define THREADS 10 void *slave(void *myid); /* shared data */ int data[ARRAYSIZE]; /* Array of numbers to sum */ int sum = 0; pthread_mutex_t mutex; /* mutually exclusive lock variable */ int wsize; /* size of work for each thread */ 37
/* end of shared data */
void *slave(void *myid) { int i,low,high,myresult=0; low = (int) myid * wsize; high = low + wsize;
for (i=low;i<high;i++) myresult += data[i]; /*printf("I am thread:%d low=%d high=%d myresult=%d \n", (int)myid, low,high,myresult);*/ pthread_mutex_lock(&mutex); sum += myresult; /* add partial sum to local sum */ pthread_mutex_unlock(&mutex); return; } main() { int i; pthread_t tid[THREADS]; pthread_mutex_init(&mutex,NULL); /* initialize mutex */ wsize = ARRAYSIZE/THREADS; for (i=0;i<ARRAYSIZE;i++) data[i] = i+1; /* wsize must be an integer */ /* initialize data[] */
for (i=0;i<THREADS;i++) /* create threads */ if (pthread_create(&tid[i],NULL,slave,(void *)i) != 0) perror("Pthread_create fails"); for (i=0;i<THREADS;i++) /* join threads */ if (pthread_join(tid[i],NULL) != 0) perror("Pthread_join fails"); printf("The sum from 1 to %i is %d\n",ARRAYSIZE,sum); } Notes: (a) pthread create(tid, attr, func, arg): Creates a new thread of control that 38
(b)
(c)
(d) (e)
executes concurrently with the calling thread. The new thread applies the function func passing it arg tid: thread id attr: An attribute structure that governs how the thread behaves. Using NULL will use default attributes. pthread join(tid, status): Suspends the execution of the calling thread until the thread identied by tid terminates. status is a reference to a location where the completion status of the waited upon thread will be stored. A mutex is a mutual exclusion device and is useful for protecting shared data structures from concurrent modications, and implementing critical data structures and monitors. A mutex has two possible states: unlocked (not owned by any thread), and locked (owned by one thread). A mutex can never be owned by two dierent threads simultaneously. A thread attempting to lock a mutex that is already locked by another thread is suspended until the owning thread unlocks the mutex rst. pthread mutex lock(pthread mutex t *mutex): Locks the given mutex. pthread mutex unlock(pthread mutex t *mutex): Unlocks the given mutex.
A pthread based matrix multiplication code: /* mm_threads.c #include #include #include #include #include #include Matrix Multiplication using Threads*/
<stdio.h> <stdlib.h> <math.h> <sys/time.h> <assert.h> <pthread.h>
#define RANDLIMIT 5 /* Magnitude limit of generated randno.*/ #define N 8 /* Matrix Size is configurable Should be a multiple of THREADS */ #define THREADS 2 /* Define Number of Threads */ #define NUMLIMIT 70.0 void *slave (void *myid); /*Shared Data*/ float a[N][N]; float b[N][N]; float c[N][N]; void *slave( void *myid ) { int x, low, high; 39
/*Calculate which rows to calculate by each Thread*/
if (N >= THREADS) { //When the matrix has more rows than the number of t x = N/THREADS; low = (int) myid * x; high = low + x; } else { /*When the number of Threads is greater than N, This prevents the unneccessary running of the for-loop*/ x = 1; low = (int) myid; if (low >= N) { //there is nothing to calculate for the extra threads high = low; } else { high = low +1; //each thread will be calculating only one row } } int i, j, k; /*Calculation*/ /*mutual exclusion is not needed because the each Thread is accessing a different part of the array. Therefore the data integrity issue is not involved.*/ for (i=low; i<high; i++) { for (j=0; j<N; j++) { c[i][j] = 0.0; for (k=0; k<N; k++){ c[i][j] = c[i][j] + a[i][k]*b[k][j]; } } }
int main(int argc, char *argv[]) { struct timeval start, stop; int i,j; pthread_t tid[THREADS]; /* generate mxs randomly */ for (i=0; i<N; i++) for (j=0; j<N; j++) { a[i][j] = 1+(int) (NUMLIMIT*rand()/(RAND_MAX+1.0)); b[i][j] = (double) (rand() % RANDLIMIT); 40
} #ifdef PRINT /* print matrices A and B */ printf("\nMatrix A:\n"); for (i=0; i<N; i++){ for (j=0; j<N; j++) printf("%.3f\t",a[i][j]); printf("\n"); } printf("\nMatrix B:\n"); for (i=0; i<N; i++){ for (j=0; j<N; j++) printf("%.3f\t",b[i][j]); printf("\n"); } #endif /*Start Timing*/ gettimeofday(&start, 0); /*Create Threads*/ for ( i=0; i<THREADS ; i++) if (pthread_create( &tid[i], NULL, slave, (void *) i) != 0) perror ("Pthread create fails"); /*Join Threads*/ for ( i=0; i<THREADS ; i++) if (pthread_join( tid[i], NULL) != 0 ) perror ("Pthread join fails");
/*End Timing*/ gettimeofday(&stop, 0);
#ifdef PRINT /*print results*/ printf("\nAnswer = \n"); for (i=0; i<N; i++){ for (j=0; j<N; j++) printf("%.3f\t",c[i][j]); printf("\n"); } #endif 41
/*Print the timing details*/ fprintf(stdout,"Time = %.6f\n\n", (stop.tv_sec+stop.tv_usec*1e-6)-(start.tv_sec+start.tv_usec*1e-6)); return(0); }
Lecture 9; 6/5/2013: Shared Memory Programming (Part 2)

On Cache Coherence and False Sharing
Objectives: Explain the problem of cache coherence and its solution techniques. Explain false sharing and remedies.
26. Cache Coherence (This note is from the book Parallel Programming by Barry Wilkinson and Michael Allen) All modern computer systems have cache memories. The data in a multiprocessor system may be altered by dierent processors. When a processor rst references a main memory location, a copy of its contents is transferred to the cache memory associated with the processor. When the processor subsequently references the data, it accesses the cache for it. If another processor references the same main memory location, a copy of the data is transferred to the cache associated with that processor, thus creating more than one copy of the data. This is not a problem until a processor alters its cached copy; that is writes a new data value. Then a cache coherence (CC) protocol must ensure that subsequently processors obtain the newly altered data when they reference it. Cache coherence protocols use either an update policy or (more commonly) an invalidate policy. In the update policy, copies of data in all caches are updated at the time one copy is altered. In the invalidate policy, when one copy of data is altered, the same data in any other cache is invalidated (by setting a valid bit in the cache). These copies are only updated when the associated processor makes a reference for it. The programmer can assume that an eective cache coherence protocol is present in the system. 27. False Sharing: If one processor writes to ONE PART of a cache line, copies of the same cache line in other caches must be updated or invalidated though the actual data is not shared. It is known as false sharing and can have a bad eect on performance. E.g., Assume a cacheline consists of four words 0 to 3. Two processors access the same block but dierent bytes in the block (processor P0 accesses Word 3 and processor P1 accesses Word 2). 42
Suppose P0 alters Word 3. The cache coherence protocol will update or invalidate the block in the cache of P1 even though P1 never references Word 3. Suppose now P1 alters Word 2. Now the CC protocol will update or invalidate the block in the cache of P0 even though P0 never references Word 2. Results in an unfortunate ping-ponging of cache blocks. A solution for this false-sharing problem is for the compiler to alter the layout of the data stored in the main memory, separating data only altered by one processor into dierent cache blocks. However, this may be dicult for the compiler. An example of false sharing: Consider the following program: for (i=0;i<4;i++) { for (j=0;j<100000;j++) sum=sum+array[i][j]; } }
We can use 4 threads with each thread operating on a vector of 100000 elements. One implementation option: Making the variable sum shared between the threads and protecting access to it. Not good! Another implementation option: #include ... double array[4][100000], sum, s[4]; main(){ int i,j; ... for (i=0;i<4;i++) pthread_create(...,...,thr_sub,(void * i)); for (i=0;i<4;i++) pthread_join(); sum = 0.0; for (j=0;j<4;j++) sum = sum + s[j]; } void *thr_sub(void *mynum) { int j; ... s[mynum] = 0.0; for (j=0;j<100000;j++) s[mynum] = s[mynum]+array[mynum][j]; } 43
Assume the cache line size is 32 bytes. Then this program (when run on a sharedmemory multiprocessor) may take longer than the sequential version. Reason? The s is stored in a vector of four sequential 8 byte values. This 32-byte sequence, ts in a single cache line (since cache has 32 byte cache lines). In a parallel run, dierent threads are trying to update the same cache line. Because cache coherence is maintained at a cache line level, this leads to the cache line containing the array s being repeatedly invalidated, causing a substantial increase in trac on the memory bus. A remedy: make sure that the problematic data structures fall on dierent cache lines. double array[4][100000], sum, s[4][4]; /* assuming a 32-byte cache line*/ main() { ... for (j=0;j<4;j++) sum = sum + s[j][1]; } void *thr_sub(void *mynum) { ... for (j=0;j<100000;j++) s[mynum][1] = s[mynum][1]+array[mynum][j]; }
44
OpenMP (Open Multiprocessing)

Objectives: Write OpenMP codes.
Acknowledgement: This note on OpenMP is substantially based on OpenMP tutorial found at https://computing.llnl.gov/tutorials/openMP. 28. Is an application programming interface (API) that supports multi-platform shared memory multiprocessing programming. It consists of a set of compiler directives, library routines, and environment variables that inuence run-time behavior. www.openmp.org Languages supported: Fortran, C/C++ cc myle.c -fopenmp 29. pragma omp parallel A parallel region is block of code executed by all threads simultaneously.This is the fundamental OpenMP parallel construct. When a thread reaches a PARALLEL directive, it creates a team of threads and becomes the master of the team. The master is a member of that team and has thread number 0 within that team. Starting from the beginning of this parallel region, the code is duplicated and all threads will execute that code. Example 1: int main(int argc, char* argv[]) { #pragma omp parallel printf("Hello, world.\n"); return 0; } There is an implied barrier at the end of a parallel section. Only the master thread continues execution past this point. If any thread terminates within a parallel region, all threads in the team will terminate, and the work done up until that point is undened. Format: #pragma omp parallel [clause ...] newline if (scalar_expression) private (list) shared (list) default (shared | none) firstprivate (list) 45
reduction (operator: list) copyin (list) num_threads (integer-expression)
structured_block The number of threads in a parallel region is determined by the following factors, in order of precedence: (a) (b) (c) (d) Evaluation of the IF clause Setting of the NUM THREADS clause Use of the omp set num threads() library function Setting of the OMP NUM THREADS environment variable e.g., setenv OMP NUM THREADS 2 (e) Implementation default - usually the number of CPUs on a node, though it could be dynamic
Threads are numbered from 0 (master thread) to N-1 Restrictions: A parallel region must be a structured block that does not span multiple routines or code les It is illegal to branch into or out of a parallel region Only a single IF clause is permitted Only a single NUM THREADS clause is permitted Example 2: #include <omp.h> main () { int nthreads, tid; /*Fork a team of threads giving them their own copies of variables*/ #pragma omp parallel private(tid) { /* Obtain and print thread id */ tid = omp_get_thread_num(); printf("Hello World from thread = %d\n", tid); /* Only master thread does this */ if (tid == 0) { nthreads = omp_get_num_threads(); printf("Number of threads = %d\n", nthreads); } } /* All threads join master thread and terminate */ } 46
30. Work-sharing Constructs A work-sharing construct divides the execution of the enclosed code region among the members of the team that encounter it. OpenMP Work sharing constructs: !$OMP DO ... !$OMP END DO #pragma omp for { ... } !$OMP SECTIONS !$OMP SINGLE ... ... !$OMP END SECTIONS !$OMP END SINGLE #pragma omp sections { ... } #pragma omp single { ... }
Table 1: Worksharing constructs These must be enclosed in a parallel region. Work-sharing constructs do not launch new threads. There is no implied barrier upon entry to a work-sharing construct, however there is an implied barrier at the end of a work sharing construct. Types of Work-Sharing Constructs: FOR - shares iterations of a loop across the team. Represents a type of data parallelism. SECTIONS - breaks work into separate, discrete sections. Each section is executed by a thread. Can be used to implement a type of functional parallelism SINGLE - serializes a section of code Restrictions: A work-sharing construct must be enclosed dynamically within a parallel region in order for the directive to execute in parallel. Work-sharing constructs must be encountered by all members of a team or none at all Successive work-sharing constructs must be encountered in the same order by all members of a team
47
FOR Directive: The FOR directive species that the iterations of the loop immediately following it must be executed in parallel by the team. This assumes a parallel region has already been initiated, otherwise it executes in serial on a single processor. Format: #pragma omp for [clause ...] newline schedule (type [,chunk]) ordered private (list) firstprivate (list) lastprivate (list) shared (list) reduction (operator: list) nowait for_loop
SCHEDULE: Describes how iterations of the loop are divided among the threads in the team. The default schedule is implementation dependent. STATIC Loop iterations are divided into pieces of size chunk and then statically assigned to threads. If chunk is not specied, the iterations are evenly (if possible) divided contiguously among the threads. DYNAMIC Loop iterations are divided into pieces of size chunk, and dynamically scheduled among the threads; when a thread nishes one chunk, it is dynamically assigned another. The default chunk size is 1. GUIDED For a chunk size of 1, the size of each chunk is proportional to the number of unassigned iterations divided by the number of threads, decreasing to 1. For a chunk size with value k (greater than 1), the size of each chunk is determined in the same way with the restriction that the chunks do not contain fewer than k iterations (except for the last chunk to be assigned, which may have fewer than k iterations). The default chunk size is 1. RUNTIME The scheduling decision is deferred until runtime by the environment variable OMP SCHEDULE. It is illegal to specify a chunk size for this clause. e.g., setenv OMP SCHEDULE guided, 4 setenv OMP SCHEDULE dynamic ORDERED: Species that the iterations of the loop must be executed as they would be in a serial program. NO WAIT / nowait: If specied, then threads do not synchronize at the end of the parallel loop. 48
Restrictions: The DO loop cannot be a DO WHILE loop, or a loop without loop control. Also, the loop iteration variable must be an integer and the loop control parameters must be the same for all threads. Program correctness must not depend upon which thread executes a particular iteration. It is illegal to branch out of a loop associated with a DO/for directive. The chunk size must be specied as a loop invariant integer expression, as there is no synchronization during its evaluation by dierent threads. ORDERED and SCHEDULE clauses may appear once each. Example 1: Simple vector-add program: #include <omp.h> #define CHUNKSIZE 100 #define N 1000 main () { int i, chunk; float a[N], b[N], c[N]; /* Some initializations */ for (i=0; i < N; i++) a[i] = b[i] = i * 1.0; chunk = CHUNKSIZE; #pragma omp parallel shared(a,b,c,chunk) private(i) { #pragma omp for schedule(dynamic,chunk) nowait for (i=0; i < N; i++) c[i] = a[i] + b[i]; } /* end of parallel section */
} Notes: Arrays A, B, C, and variable N will be shared by all threads. Variable I will be private to each thread; each thread will have its own unique copy. The iterations of the loop will be distributed dynamically in CHUNK sized pieces. Threads will not synchronize upon completing their individual pieces of work (NOWAIT) 49
SECTIONS Directive The SECTIONS directive is a non-iterative work-sharing construct. It species that the enclosed section(s) of code are to be divided among the threads in the team. Independent SECTION directives are nested within a SECTIONS directive. Each SECTION is executed once by a thread in the team. Dierent sections may be executed by dierent threads. It is possible that for a thread to execute more than one section if it is quick enough and the implementation permits such. Format: #pragma omp sections [clause ...] newline private (list) firstprivate (list) lastprivate (list) reduction (operator: list) nowait { #pragma omp section structured_block #pragma omp section structured_block } There is an implied barrier at the end of a SECTIONS directive, unless the NOWAIT/nowait clause is used. Restrictions: It is illegal to branch into or out of section blocks. SECTION directives must occur within the lexical extent of an enclosing SECTIONS directive newline newline
50
Example: #include <omp.h> #define N 1000 main () { int i; float a[N], b[N], c[N], d[N]; /* Some initializations */ for (i=0; i < N; i++) { a[i] = i * 1.5; b[i] = i + 22.35; } #pragma omp parallel shared(a,b,c,d) private(i) { #pragma omp sections nowait { #pragma omp section for (i=0; i < N; i++) c[i] = a[i] + b[i]; #pragma omp section for (i=0; i < N; i++) d[i] = a[i] * b[i]; } } } What happens if the number of threads and the number of SECTIONs are dierent? More threads than SECTIONs? Less threads than SECTIONs? Answer: If there are more threads than sections, some threads will not execute a section and some will. If there are more sections than threads, the implementation denes how the extra sections are executed. Which thread executes which SECTION? Answer: It is up to the implementation to decide which threads will execute a section and which threads will not, and it can vary from execution to execution. /* end of sections */
/* end of parallel section */
51
SINGLE Directive The SINGLE directive species that the enclosed code is to be executed by only one thread in the team. May be useful when dealing with sections of code that are not thread safe (such as I/O) Threads in the team that do not execute the SINGLE directive, wait at the end of the enclosed code block, unless a NOWAIT/nowait clause is specied. Restriction: It is illegal to branch into or out of a SINGLE block. Format: #pragma omp single [clause ...] newline private (list) firstprivate (list) nowait structured_block 31. Combined Parallel Work-Sharing Constructs: OpenMP provides combined directives that are merely conveniences: PARALLEL DO / parallel for and PARALLEL SECTIONS For the most part, these directives behave identically to an individual PARALLEL directive being immediately followed by a separate work-sharing directive. Most of the rules, clauses and restrictions that apply to both directives are in eect. Example: #include <omp.h> #define N 1000 #define CHUNKSIZE 100 main () { int i, chunk; float a[N], b[N], c[N]; /* Some initializations */ for (i=0; i < N; i++) a[i] = b[i] = i * 1.0; chunk = CHUNKSIZE; #pragma omp parallel for \ shared(a,b,c,chunk) private(i) \ schedule(static,chunk) for (i=0; i < n; i++) c[i] = a[i] + b[i]; } 52
32. Synchronization Constructs Critical construct: The CRITICAL directive species a region of code that must be executed by only one thread at a time. Format: #pragma omp critical [ name ] newline structured_block If a thread is currently executing inside a CRITICAL region and another thread reaches that CRITICAL region and attempts to execute it, it will block until the rst thread exits that CRITICAL region. The optional name enables multiple dierent CRITICAL regions to exist: Names act as global identiers. Dierent CRITICAL regions with the same name are treated as the same region. All CRITICAL sections which are unnamed, are treated as the same section. Restriction: It is illegal to branch into or out of a CRITICAL block. Example: #include <omp.h> main() { int x; x = 0; #pragma omp parallel shared(x) { #pragma omp critical x = x + 1; } } Barrier construct: The BARRIER directive synchronizes all threads in the team. When a BARRIER directive is reached, a thread will wait at that point until all other threads have reached that barrier. All threads then resume executing in parallel the code that follows the barrier. Format: #pragma omp barrier newline Restrictions: Restrictions: All threads in a team (or none) must execute the BARRIER region. The sequence of work-sharing regions and barrier regions encountered must be the same for every thread in a team. 53 /* end of parallel section */
33. Some FORTRAN examples Example 1: !$omp parallel if (n>10000) default (none) & !$omp shared(n,a,b,c,x,y,z) private(f,i,scale) f = 1.0; !$omp do do i =1, n z(i) = x(i) + y(i) enddo !$omp end do nowait !$omp do do i = 1, n a(i) = b(i) + c(i) end do !$omp end do nowait !$omp barrier ... scale = sum(a(1:n)) + sum(z(1:n)) + f ... !$omp end parallel Example 2: !$omp parallel default (none) & !$omp shared(n,a,b,c,d) private(i) !$omp sections !$omp section do i = 1, n-1 b(i) = (a(i) + a(i+1))/2 enddo !$omp section do i = 1, n d(i) = 1.0/c(i) enddo !$omp end sections nowait !$omp end parallel
54
Example 3: !$omp parallel ... !$omp critical <will be executed serially> !$omp end critical ... !$omp end parallel 34. A note on the ordered clause in the FOR directive (taken from the book: Parallel Programming in OpenMP by Chandra, Dagum, Kohr, Maydan, McDonald, Menon) The ordered section directive is used to impose an order across the iterations of a parallel loop. As described earlier, the iterations of a parallel loop are assumed to be independent of each other and execute concurrently without synchronization. With the ordered section directive, however, we can identify a portion of code within each loop iteration that must be executed in the original, sequential order of the loop iterations. Instances of this portion of code from dierent iterations execute in the same order, one after the other, as they would have executed if the loop had not been parallelized. Example: !$omp parallel do ordered do i=1,n a(i) = ... complex calculations here... !wait until the previous iteration has !finished its ordered section !$omp ordered print *, a(i) !signal the completion of the ordered !from this iteration <will be executed serially> !$omp end ordered enddo !$omp end parallel 35. The threadprivate directive This directive is used to identify a global variable in C,C++ (or a common block in FORTRAN) as being private to each thread.If a global variable (or a common block) is marked as threadprivate using this directive, then a private copy of that global variable (or entire common block) is created for each thread. Example: /* OpenMP example program demonstrating threadprivate variables Compile with: gcc -O3 -fopenmp omp_threadprivate.c -o omp_threadprivate 55
Source: http://users.abo.fi/mats/PP2012/examples/OpenMP/omp_threadprivate.c */ #include <omp.h> #include <stdio.h> #include <stdlib.h> int a, b, i, x, y, z, tid; #pragma omp threadprivate(a,x, z)
/* a,x and z are threadprivate*/
int main (int argc, char *argv[]) { /* Initialize the variables */ a = b = x = y = z = 0; /* Fork a team of threads */ #pragma omp parallel private(b, tid) { tid = omp_get_thread_num(); a = b = tid; /* a and b gets the value of the thread id */ x = tid+10; /* x is 10 plus the value of the thread id */ } /* This section is now executed serially */ for (i=0; i< 1000; i++) { y += i; } z = 40; /* Initialize z outside the parallel region */
/* Fork a new team of threads and initialize the threadprivate variable z */ #pragma omp parallel private(tid) copyin(z) { tid = omp_get_thread_num(); z = z+tid; /* The variables a and x will keep their values from the last parallel region but b will not. z will be initialized to 40 */ printf("Thread %d: a = %d b = %d x = %d z = %d\n", tid, a, b, x, z); } /* All threads join master thread */ exit(0); }
56
Note: Following note is quoted from the book: Parallel Programming in OpenMP by Chandra, Dagum, Kohr, Maydan, McDonald, Menon When a program enters a parallel region, a team of parallel threads is created. This team consists of the original master thread and some number of additional slave threads. Each slave thread has its own copy of the threadprivate variables, while the master thread continues to access its private copy as well. Bothe the initial copy of the master thread, as well as the copies within each of the slave threads, are initialized in the same way as the the master threads copy of those variables would be initialized in a serial instance of the program. When the end of a parallel region is reached, the slave threads disappear, but they do not die. Rather, they park themselves on a queue waiting for the next parallel region. In addtion, although the slave threads are dormant, they still retain their state, in particular their instances of the threadprivate variables. As a result, the contents of threadprivate data persist for each thread from one parallel region to another. When the next parallel region is reached and the slave threads are re-engaged, they can access their threadprivate data and nd the values computed at the end of the previous parallel region. This persistence is guaranteed within OpenMP so long as the number of threads does not change. If the user modies the requested number of parallel threads, then a new set of slave threads will be created, each with a freshly initialized set of threadprivate data. Finally, during the serial portions of the program, only the master thread executes, and it accesses its private copy of the threadprivate data. End of quote. 36. Some OpenMP Notes Some data scope attribute clauses default clause: The default clause allows the user to specify a default PRIVATE, SHARED, or NONE scope for all variables in the lexical extent of any parallel region. rstprivate clause: The rstprivate clause combines the behavior of the PRIVATE clause with automatic initialization of the variables in its list. format: rstprivate (list) Listed variables are initialized according to the value of their original objects prior to entry into the parallel or work-sharing construct. reduction clause: The reduction clause performs a reduction on the variables that appear in its list. A private copy for each list variable is created for each thread. At the end of the reduction, the reduction variable is applied to all private copies of the shared variable, and the nal result is written to the global shared variable. format: reduction (operator: list)
57
An example: #include <omp.h> main () {
int i, n, chunk; float a[100], b[100], result; /* Some initializations */ n = 100; chunk = 10; result = 0.0; for (i=0; i < n; i++) { a[i] = i * 1.0; b[i] = i * 2.0; } #pragma omp parallel for default(shared) private(i) schedule(static,chunk) reduction(+:result) \ \ \
for (i=0; i < n; i++) result = result + (a[i] * b[i]); printf("Final result= %f\n",result); } copyin clause: The copyin clause provides a means for assigning the same value to threadprivate variables for all threads in the team. When a copyin clause is supplied with a parallel directive, the named threadprivate variables (or the entire threadprivate common block in FORTRAN if specied) within the private copy of each slave thread are initialized with the corresponding values in the masters copy. format: copyin (list) List contains the names of variables to copy. In Fortran, the list can contain both the names of common blocks and named variables. lastprivate clause: The lastprivate clause combines the behavior of the PRIVATE clause with a copy from the last loop iteration or section to the original variable object. format: lastprivate (list) The value copied back into the original variable object is obtained from the last (sequentially) iteration or section of the enclosing construct. For example, the team member which executes the nal iteration for a DO section, or the team member which does the last section of a sections context performs the copy with its own values
58

HPC Note

Încărcat de

Informații document

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

HPC Note

Încărcat de

Drepturi de autor:

Formate disponibile

CS4007/SCS4007 High Performance Computing

Malik Silva (mks@ucsc.cmb.ac.lk) June 3, 2013

High Performance Computing

Part 2: Basic Serial Performance Optimization Guidelines

Part 3: Guideline: Use of Compiler Optimizations

High Performance Computing

Figure 1: The memory wall problem

usage usage usage usage

int main(int argc, char *argv[]) { int i,j,k; 11

/* Two results */ printf("%.3f\t",c[0]); printf("%.3f\n",c[N*N-1]); }

High Performance Computing

Part 1: Guideline: Use of a Good Computing Environment

Part 2: Guideline: Selection of a Good Algorithm

Stride = the increment in memory address between successive elements addressed.

High Performance Computing

the term given to indicate OS initiated process switching

Lecture 5; 4/3/2013: Parallel Program Design (Part 1)

Lecture 6; 11/3/2013: MPI (Message Passing Interface)

Lecture 8; 22/4/2013: Shared Memory Programming (Part 1)

/* end of shared data */

<stdio.h> <stdlib.h> <math.h> <sys/time.h> <assert.h> <pthread.h>

/*Calculate which rows to calculate by each Thread*/

/*End Timing*/ gettimeofday(&stop, 0);

/*Print the timing details*/ fprintf(stdout,"Time = %.6f\n\n", (stop.tv_sec+stop.tv_usec*1e-6)-(start.tv_sec+start.tv_usec*1e-6)); return(0); }

Lecture 9; 6/5/2013: Shared Memory Programming (Part 2)

OpenMP (Open Multiprocessing)

reduction (operator: list) copyin (list) num_threads (integer-expression)

/* end of parallel section */

/* a,x and z are threadprivate*/

An example: #include <omp.h> main () {

S-ar putea să vă placă și

/* Two results / printf("%.3f\t",c[0]); printf("%.3f\n",c[NN-1]); }

/Calculate which rows to calculate by each Thread/

/End Timing/ gettimeofday(&stop, 0);

/Print the timing details/ fprintf(stdout,"Time = %.6f\n\n", (stop.tv_sec+stop.tv_usec1e-6)-(start.tv_sec+start.tv_usec1e-6)); return(0); }