Documente Academic
Documente Profesional
Documente Cultură
Lecture Outline
Basics of Shared-Memory Programming OpenMP Summary what it is good for what it is not good for OpenMP Minutae parallelizing loops parallel sections Example: SCF code using OpenMP homework: tweak MPI SCF code to use OpenMP
L2 cache
L2 cache
RAM
RAM
concurrent reads/writes of shared resources (e.g. global variables) by multiple threads produce undened results...
race condition!
races occur when multiple threads simultaneously mutate a shared resource; the outcome is not dened!
// determine id of the last thread to execute this code static int thread_id_sum; const int thread_id = this_thread::get_id(); mutex lock; lock.lock(); thread_id_sum += thread_id; lock.unlock();
critical section
critical section in a code is executed by one thread at a time Best practice avoid sharing resources between threads. Reducing the scope of the data also improves the program design! when sharing resources, avoid writing to the same location by multiple resources if must update same resource, do so in critical sections to avoid races. Consider using atomic variables (see C++0x standard).
2009 HPC-Chem Summer School, Knoxville, TN 5
What is OpenMP?
Synopsis
double* X = new double[N]; double* Y = new double[N]; const double a = 3.2; initialize(X); initialize(Y);
// perform DAXPY operation in parallel int i; #pragma omp parallel for for (i=0; i<N; ++i) Y[i] += a * X[i];
OpenMP is preprocessor statements (pragmas) + API + runtime support OpenMP is a threading tool for non-experts (i.e. most of us) !!! Best for simple (perfectly-nested) loops regular data-access patterns Less appropriate (but perfectly useful!) for nontrivial code that is likely to need debugging imprefectly-nested loops master-slave parallelization
Hello, World!
helloworld.cc
#include <omp.h> #include <iostream> // must include omp.h in OpenMP-enabled code
int main(int argc, char** argv) { #pragma omp parallel // threads are spawned at the start of the parallel region { const int thread_id = omp_get_thread_num(); #pragma omp critical // what happens if you remove this line?! std::cout << "Hello, World, from thread " << thread_id << std::endl; #pragma omp barrier if (thread_id == 0) { const int nthread = omp_get_num_threads(); std::cout << "There are " << nthread << " threads" << std::endl; } } // the end of parallel region -- synchronization return 0; }
Compile/Run
[ThinAir:~/test/openmp] evaleev% g++ -fopenmp -g ./helloworld.cc -o helloworld [ThinAir:~/test/openmp] evaleev% setenv OMP_NUM_THREADS 2 [ThinAir:~/test/openmp] evaleev% ./helloworld Hello, World, from thread 0 Hello, World, from thread 1 There are 2 threads
int main(int argc, char** argv) { #pragma omp parallel { #if _OPENMP const int thread_id = omp_get_thread_num(); #else const int thread_id = 0; #endif #pragma omp critical std::cout << "Hello, World, from thread " << thread_id << std::endl; #pragma omp barrier if (thread_id == 0) { // TODO "protect" this line also!!! const int nthread = omp_get_num_threads(); std::cout << "There are " << nthread << " threads" << std::endl; } } return 0; }
// d = a * b
#pragma omp parallel for private(i) reduction(+:sum) for(i=0; i<N; ++i) sum += c[i] + d[i]; // without "reduction" would be race condition! std::cout << "sum = " << sum << std::endl; return 0; }
#pragma omp for private(i) reduction(+:sum) for(i=0; i<N; ++i) sum += c[i] + d[i]; // without "reduction" would be race condition! } // the end of parallel region -- synchronization std::cout << "sum = " << sum << std::endl; return 0; }
2009 HPC-Chem Summer School, Knoxville, TN 10
both loops
void make_overlap(int nbf, double* overlap) { int i, j; const int thread_id = omp_get_thread_num(); const int nthread = omp_get_num_threads(); #pragma omp parallel private(i, j) // parallelize imperfectly nested loops for(i=0; i<nbf; ++i) { for(j=0; j<=i; ++j) { const int IJ = i*nbf + j; if (IJ % nthread != thread_id) continue; // round-robin work distribution const int JI = j*nbf + i; overlap[IJ] = overlap[JI] = s(i,j); } } }
2009 HPC-Chem Summer School, Knoxville, TN 13
Critical sections
loops5.cc
#include <omp.h> #include <iostream> #include <algorithm> int main(int argc, char** argv) { const int N = 1024; double* a = new double[N]; std::fill(a, a+N, 2.0); double* b = new double[N]; std::fill(b, b+N, 3.0); double* c = new double[N]; double sum = 0.0; int i; #pragma omp parallel for private(i) for(i=0; i<N; ++i) c[i] = a[i] + b[i]; #pragma omp parallel for private(i) for(i=0; i<N; ++i) #pragma critical sum += c[i]; // less efficient than "reduction"! std::cout << "sum = " << sum << std::endl; return 0; }
Cache considerations
cache.cc
#include <omp.h> #define N 65536 /* 64k */
int main() { char source[N], destination[N]; int j; #pragma omp parallel shared(source, destination) { const int thread_id = omp_get_thread_num(); const int nthread = omp_get_num_threads(); for (int j=0; j < 10000; j++) { for(int k=thread_id; k<N; k+=nthread) destination[k] = source[k]; // cache coherence penalty } } return(0); }
Summary
What else is there: sections, atomic stores/loads, nested parallelism, locks OpenMP is easy. except when it isnt The current version is 2.5. OpenMP v 3.0 adds additional useful features (parallel nested loops, tasks). Look for it in GNU (4.4+) and Intel compilers.
Homework
Compile and run OpenMP helloworld.cc. Play with it! Enable/disable critical section and barrier. What happens? Why? Use omp_get_num_procs() function to obtain the number of cores on the machine. How many does a Kraken node have? Surveyor? Modify MPI SCF code to use OpenMP also for multi-level parallelism.