Open MP

Introduction to OpenMP
Edward Valeev Department of Chemistry Virginia Tech Blacksburg, VA
2009 HPC-Chem Summer School, Knoxville, TN 1
Lecture Outline
Basics of Shared-Memory Programming OpenMP Summary what it is good for what it is not good for OpenMP Minutae parallelizing loops parallel sections Example: SCF code using OpenMP homework: tweak MPI SCF code to use OpenMP
Thread vs. Process

Process heavyweight task has an ID, instruction pointer, stack, heap, le pointers, and other resources do not share address spaces (i.e. a heap pointer in 2 different processes corresponds to different physical memory addresses) Thread lightweight task has instruction pointer and stack share heap and other resources threads in a process share address spaces (i.e. a heap pointer in 2 different threads corresponds to the same physical memory address)
message passing: MPI vs. threads

static int task0_result; if (task_id == 0) task0_result = compute_result(); // MPI tasks are processes -- must communicate the result MPI_Send(&task0_result, 1, MPI_INT, -1, 0, MPI_COMM_WORLD); extern int task0_result; if (task_id == 0) task0_result = compute_result(); // all threads see global task0_result!
Uniprocessors vs. Shared-Memory Multiprocessors

generic uniprocessor Core execution unit L1 cache generic 2-core processor Core 0 execution unit L1 cache Core 1 execution unit L1 cache
L2 cache
L2 cache
RAM
RAM
concurrent reads/writes of shared resources (e.g. global variables) by multiple threads produce undened results...
... and cache coherence is an issue!
Race Conditions, Critical Sections

// sum up the ids of all threads that executed this code static int thread_id_sum; const int thread_id = this_thread::get_id(); thread_id_sum += thread_id;
race condition!
races occur when multiple threads simultaneously mutate a shared resource; the outcome is not dened!
// determine id of the last thread to execute this code static int thread_id_sum; const int thread_id = this_thread::get_id(); mutex lock; lock.lock(); thread_id_sum += thread_id; lock.unlock();
critical section
critical section in a code is executed by one thread at a time Best practice avoid sharing resources between threads. Reducing the scope of the data also improves the program design! when sharing resources, avoid writing to the same location by multiple resources if must update same resource, do so in critical sections to avoid races. Consider using atomic variables (see C++0x standard).
What is OpenMP?
Synopsis
double* X = new double[N]; double* Y = new double[N]; const double a = 3.2; initialize(X); initialize(Y);
// perform DAXPY operation in parallel int i; #pragma omp parallel for for (i=0; i<N; ++i) Y[i] += a * X[i];
OpenMP is preprocessor statements (pragmas) + API + runtime support OpenMP is a threading tool for non-experts (i.e. most of us) !!! Best for simple (perfectly-nested) loops regular data-access patterns Less appropriate (but perfectly useful!) for nontrivial code that is likely to need debugging imprefectly-nested loops master-slave parallelization
Hello, World!
helloworld.cc
#include <omp.h> #include <iostream> // must include omp.h in OpenMP-enabled code
int main(int argc, char** argv) { #pragma omp parallel // threads are spawned at the start of the parallel region { const int thread_id = omp_get_thread_num(); #pragma omp critical // what happens if you remove this line?! std::cout << "Hello, World, from thread " << thread_id << std::endl; #pragma omp barrier if (thread_id == 0) { const int nthread = omp_get_num_threads(); std::cout << "There are " << nthread << " threads" << std::endl; } } // the end of parallel region -- synchronization return 0; }
Compile/Run
[ThinAir:~/test/openmp] evaleev% g++ -fopenmp -g ./helloworld.cc -o helloworld [ThinAir:~/test/openmp] evaleev% setenv OMP_NUM_THREADS 2 [ThinAir:~/test/openmp] evaleev% ./helloworld Hello, World, from thread 0 Hello, World, from thread 1 There are 2 threads
Portable Hello, World!

helloworld.cc
#if _OPENMP # include <omp.h> #endif #include <iostream> // OpenMP-capable compiler will set macro _OPENMP to 1
int main(int argc, char** argv) { #pragma omp parallel { #if _OPENMP const int thread_id = omp_get_thread_num(); #else const int thread_id = 0; #endif #pragma omp critical std::cout << "Hello, World, from thread " << thread_id << std::endl; #pragma omp barrier if (thread_id == 0) { // TODO "protect" this line also!!! const int nthread = omp_get_num_threads(); std::cout << "There are " << nthread << " threads" << std::endl; } } return 0; }
Can compile with and without OpenMP

[ThinAir:~/test/openmp] evaleev% g++ -fopenmp -g ./helloworld.cc -o helloworld [ThinAir:~/test/openmp] evaleev% g++ -g ./helloworld.cc -o helloworld
Simple for loops

loops1.cc
#include <omp.h> #include <iostream> #include <algorithm> int main(int argc, char** argv) { const int N = 1024; double* a = new double[N]; std::fill(a, a+N, 2.0); double* b = new double[N]; std::fill(b, b+N, 3.0); double* c = new double[N]; double* d = new double[N]; double sum = 0.0; int i; #pragma omp parallel for private(i) for(i=0; i<N; ++i) c[i] = a[i] + b[i]; #pragma omp parallel for private(i) for(i=0; i<N; ++i) d[i] = a[i] * b[i]; // c = a + b
// d = a * b
#pragma omp parallel for private(i) reduction(+:sum) for(i=0; i<N; ++i) sum += c[i] + d[i]; // without "reduction" would be race condition! std::cout << "sum = " << sum << std::endl; return 0; }
Multiple for loops

loops2.cc
#include <omp.h> #include <iostream> #include <algorithm> int main(int argc, char** argv) { const int N = 1024; double* a = new double[N]; std::fill(a, a+N, 1.0); double* b = new double[N]; std::fill(b, b+N, 2.0); double* c = new double[N]; double* d = new double[N]; double sum = 0.0; #pragma omp parallel // threads live throughout multiple for loops { int i; #pragma omp for private(i) nowait // no synchronization on exit because of nowait for(i=0; i<N; ++i) c[i] = a[i] + b[i]; #pragma omp for private(i) for(i=0; i<N; ++i) d[i] = a[i] * b[i]; // implied synchronization on exit! (barrier)
#pragma omp for private(i) reduction(+:sum) for(i=0; i<N; ++i) sum += c[i] + d[i]; // without "reduction" would be race condition! } // the end of parallel region -- synchronization std::cout << "sum = " << sum << std::endl; return 0; }
Scheduling for loops

loops3.cc
#include <omp.h> #include <iostream> #include <algorithm> int main(int argc, char** argv) { const int N = 1024; double* a = new double[N]; std::fill(a, a+N, 1.0); double* b = new double[N]; std::fill(b, b+N, 2.0); double* c = new double[N]; double sum = 0.0; int i; // schedule this for loop in a dynamic fashion // other options for schedule clause: static, guided, runtime, auto #pragma omp parallel for private(i) schedule(dynamic) for(i=0; i<N; ++i) c[i] = a[i] + b[i]; #pragma omp parallel for private(i) reduction(+:sum) for(i=0; i<N; ++i) sum += c[i]; std::cout << "sum = " << sum << std::endl; return 0; }
Collapse nested for loops (in OpenMP v 3.0)

loops4.cc
#include <omp.h> #include <iostream> #include <algorithm> int main(int argc, char** argv) { const int N = 1024; double* a = new double[N]; std::fill(a, a+N, 1.0); double* b = new double[N]; std::fill(b, b+N, 2.0); double* c = new double[N]; double sum = 0.0; #pragma omp parallel // threads live throughout multiple for loops { int i, j; #pragma omp for private(i) // c = a + b for(i=0; i<N; ++i) c[i] = a[i] + b[i]; #pragma omp for private(i,j) collapse(2) reduction(+:sum) for(j=0; j<N; ++j) for(i=0; i<N; ++i) sum += c[i]; } // the end of parallel region -- synchronization std::cout << "sum = " << sum << std::endl; return 0; }
Threading SCF: overlap

outer loop only
void make_overlap(int nbf, double* overlap) { int i, j; #pragma omp parallel for private(i, j) // only outer loop parallelized for(i=0; i<nbf; ++i) { for(j=0; j<=i; ++j) { const int IJ = i*nbf + j; const int JI = j*nbf + i; overlap[IJ] = overlap[JI] = s(i,j); } } }
both loops
void make_overlap(int nbf, double* overlap) { int i, j; const int thread_id = omp_get_thread_num(); const int nthread = omp_get_num_threads(); #pragma omp parallel private(i, j) // parallelize imperfectly nested loops for(i=0; i<nbf; ++i) { for(j=0; j<=i; ++j) { const int IJ = i*nbf + j; if (IJ % nthread != thread_id) continue; // round-robin work distribution const int JI = j*nbf + i; overlap[IJ] = overlap[JI] = s(i,j); } } }
Threading SCF: overlap

collapse loops in OpenMP v. 3.0
void make_overlap(int nbf, double* overlap) { int i, j; #pragma omp parallel for collapse(2) private(i, j) for(i=0; i<nbf; ++i) { for(j=0; j<nbf; ++j) { const int IJ = i*nbf + j; overlap[IJ] = s(i,j); } } }
Threading SCF: Fock matrix
see scf/openmp in the repository
Critical sections
loops5.cc
#include <omp.h> #include <iostream> #include <algorithm> int main(int argc, char** argv) { const int N = 1024; double* a = new double[N]; std::fill(a, a+N, 2.0); double* b = new double[N]; std::fill(b, b+N, 3.0); double* c = new double[N]; double sum = 0.0; int i; #pragma omp parallel for private(i) for(i=0; i<N; ++i) c[i] = a[i] + b[i]; #pragma omp parallel for private(i) for(i=0; i<N; ++i) #pragma critical sum += c[i]; // less efficient than "reduction"! std::cout << "sum = " << sum << std::endl; return 0; }
Cache considerations
cache.cc
#include <omp.h> #define N 65536 /* 64k */
int main() { char source[N], destination[N]; int j; #pragma omp parallel shared(source, destination) { const int thread_id = omp_get_thread_num(); const int nthread = omp_get_num_threads(); for (int j=0; j < 10000; j++) { for(int k=thread_id; k<N; k+=nthread) destination[k] = source[k]; // cache coherence penalty } } return(0); }
2 threads take about the same time as 1 on my laptop
Summary
What else is there: sections, atomic stores/loads, nested parallelism, locks OpenMP is easy. except when it isnt The current version is 2.5. OpenMP v 3.0 adds additional useful features (parallel nested loops, tasks). Look for it in GNU (4.4+) and Intel compilers.
Homework
Compile and run OpenMP helloworld.cc. Play with it! Enable/disable critical section and barrier. What happens? Why? Use omp_get_num_procs() function to obtain the number of cores on the machine. How many does a Kraken node have? Surveyor? Modify MPI SCF code to use OpenMP also for multi-level parallelism.

Open MP

Încărcat de

Informații document

Descriere originală:

Titlu original

Drepturi de autor

Formate disponibile

Partajați acest document

Partajați sau inserați document

Opțiuni de partajare

Vi se pare util acest document?

Este necorespunzător acest conținut?

Drepturi de autor:

Formate disponibile

Open MP

Încărcat de

Drepturi de autor:

Formate disponibile

Introduction to OpenMP

Edward Valeev Department of Chemistry Virginia Tech Blacksburg, VA

2009 HPC-Chem Summer School, Knoxville, TN 1

2009 HPC-Chem Summer School, Knoxville, TN 2

Thread vs. Process

message passing: MPI vs. threads

2009 HPC-Chem Summer School, Knoxville, TN 3

Uniprocessors vs. Shared-Memory Multiprocessors

... and cache coherence is an issue!

2009 HPC-Chem Summer School, Knoxville, TN 4

Race Conditions, Critical Sections

2009 HPC-Chem Summer School, Knoxville, TN 6

2009 HPC-Chem Summer School, Knoxville, TN 7

Portable Hello, World!

Can compile with and without OpenMP

2009 HPC-Chem Summer School, Knoxville, TN 8

Simple for loops

2009 HPC-Chem Summer School, Knoxville, TN 9

Multiple for loops

Scheduling for loops

2009 HPC-Chem Summer School, Knoxville, TN 11

Collapse nested for loops (in OpenMP v 3.0)

2009 HPC-Chem Summer School, Knoxville, TN 12

Threading SCF: overlap

Threading SCF: overlap

2009 HPC-Chem Summer School, Knoxville, TN 14

Threading SCF: Fock matrix

see scf/openmp in the repository

2009 HPC-Chem Summer School, Knoxville, TN 15

2009 HPC-Chem Summer School, Knoxville, TN 16

2 threads take about the same time as 1 on my laptop

2009 HPC-Chem Summer School, Knoxville, TN 17

2009 HPC-Chem Summer School, Knoxville, TN 18

2009 HPC-Chem Summer School, Knoxville, TN 19

S-ar putea să vă placă și