Sunteți pe pagina 1din 19

Introduction to OpenMP

Edward Valeev Department of Chemistry Virginia Tech Blacksburg, VA

2009 HPC-Chem Summer School, Knoxville, TN 1

Lecture Outline
Basics of Shared-Memory Programming OpenMP Summary what it is good for what it is not good for OpenMP Minutae parallelizing loops parallel sections Example: SCF code using OpenMP homework: tweak MPI SCF code to use OpenMP

2009 HPC-Chem Summer School, Knoxville, TN 2

Thread vs. Process


Process heavyweight task has an ID, instruction pointer, stack, heap, le pointers, and other resources do not share address spaces (i.e. a heap pointer in 2 different processes corresponds to different physical memory addresses) Thread lightweight task has instruction pointer and stack share heap and other resources threads in a process share address spaces (i.e. a heap pointer in 2 different threads corresponds to the same physical memory address)

message passing: MPI vs. threads


static int task0_result; if (task_id == 0) task0_result = compute_result(); // MPI tasks are processes -- must communicate the result MPI_Send(&task0_result, 1, MPI_INT, -1, 0, MPI_COMM_WORLD); extern int task0_result; if (task_id == 0) task0_result = compute_result(); // all threads see global task0_result!

2009 HPC-Chem Summer School, Knoxville, TN 3

Uniprocessors vs. Shared-Memory Multiprocessors


generic uniprocessor Core execution unit L1 cache generic 2-core processor Core 0 execution unit L1 cache Core 1 execution unit L1 cache

L2 cache

L2 cache

RAM

RAM

concurrent reads/writes of shared resources (e.g. global variables) by multiple threads produce undened results...

... and cache coherence is an issue!

2009 HPC-Chem Summer School, Knoxville, TN 4

Race Conditions, Critical Sections


// sum up the ids of all threads that executed this code static int thread_id_sum; const int thread_id = this_thread::get_id(); thread_id_sum += thread_id;

race condition!

races occur when multiple threads simultaneously mutate a shared resource; the outcome is not dened!
// determine id of the last thread to execute this code static int thread_id_sum; const int thread_id = this_thread::get_id(); mutex lock; lock.lock(); thread_id_sum += thread_id; lock.unlock();

critical section

critical section in a code is executed by one thread at a time Best practice avoid sharing resources between threads. Reducing the scope of the data also improves the program design! when sharing resources, avoid writing to the same location by multiple resources if must update same resource, do so in critical sections to avoid races. Consider using atomic variables (see C++0x standard).
2009 HPC-Chem Summer School, Knoxville, TN 5

What is OpenMP?
Synopsis
double* X = new double[N]; double* Y = new double[N]; const double a = 3.2; initialize(X); initialize(Y);

// perform DAXPY operation in parallel int i; #pragma omp parallel for for (i=0; i<N; ++i) Y[i] += a * X[i];

OpenMP is preprocessor statements (pragmas) + API + runtime support OpenMP is a threading tool for non-experts (i.e. most of us) !!! Best for simple (perfectly-nested) loops regular data-access patterns Less appropriate (but perfectly useful!) for nontrivial code that is likely to need debugging imprefectly-nested loops master-slave parallelization

2009 HPC-Chem Summer School, Knoxville, TN 6

Hello, World!
helloworld.cc
#include <omp.h> #include <iostream> // must include omp.h in OpenMP-enabled code

int main(int argc, char** argv) { #pragma omp parallel // threads are spawned at the start of the parallel region { const int thread_id = omp_get_thread_num(); #pragma omp critical // what happens if you remove this line?! std::cout << "Hello, World, from thread " << thread_id << std::endl; #pragma omp barrier if (thread_id == 0) { const int nthread = omp_get_num_threads(); std::cout << "There are " << nthread << " threads" << std::endl; } } // the end of parallel region -- synchronization return 0; }

Compile/Run
[ThinAir:~/test/openmp] evaleev% g++ -fopenmp -g ./helloworld.cc -o helloworld [ThinAir:~/test/openmp] evaleev% setenv OMP_NUM_THREADS 2 [ThinAir:~/test/openmp] evaleev% ./helloworld Hello, World, from thread 0 Hello, World, from thread 1 There are 2 threads

2009 HPC-Chem Summer School, Knoxville, TN 7

Portable Hello, World!


helloworld.cc
#if _OPENMP # include <omp.h> #endif #include <iostream> // OpenMP-capable compiler will set macro _OPENMP to 1

int main(int argc, char** argv) { #pragma omp parallel { #if _OPENMP const int thread_id = omp_get_thread_num(); #else const int thread_id = 0; #endif #pragma omp critical std::cout << "Hello, World, from thread " << thread_id << std::endl; #pragma omp barrier if (thread_id == 0) { // TODO "protect" this line also!!! const int nthread = omp_get_num_threads(); std::cout << "There are " << nthread << " threads" << std::endl; } } return 0; }

Can compile with and without OpenMP


[ThinAir:~/test/openmp] evaleev% g++ -fopenmp -g ./helloworld.cc -o helloworld [ThinAir:~/test/openmp] evaleev% g++ -g ./helloworld.cc -o helloworld

2009 HPC-Chem Summer School, Knoxville, TN 8

Simple for loops


loops1.cc
#include <omp.h> #include <iostream> #include <algorithm> int main(int argc, char** argv) { const int N = 1024; double* a = new double[N]; std::fill(a, a+N, 2.0); double* b = new double[N]; std::fill(b, b+N, 3.0); double* c = new double[N]; double* d = new double[N]; double sum = 0.0; int i; #pragma omp parallel for private(i) for(i=0; i<N; ++i) c[i] = a[i] + b[i]; #pragma omp parallel for private(i) for(i=0; i<N; ++i) d[i] = a[i] * b[i]; // c = a + b

// d = a * b

#pragma omp parallel for private(i) reduction(+:sum) for(i=0; i<N; ++i) sum += c[i] + d[i]; // without "reduction" would be race condition! std::cout << "sum = " << sum << std::endl; return 0; }

2009 HPC-Chem Summer School, Knoxville, TN 9

Multiple for loops


loops2.cc
#include <omp.h> #include <iostream> #include <algorithm> int main(int argc, char** argv) { const int N = 1024; double* a = new double[N]; std::fill(a, a+N, 1.0); double* b = new double[N]; std::fill(b, b+N, 2.0); double* c = new double[N]; double* d = new double[N]; double sum = 0.0; #pragma omp parallel // threads live throughout multiple for loops { int i; #pragma omp for private(i) nowait // no synchronization on exit because of nowait for(i=0; i<N; ++i) c[i] = a[i] + b[i]; #pragma omp for private(i) for(i=0; i<N; ++i) d[i] = a[i] * b[i]; // implied synchronization on exit! (barrier)

#pragma omp for private(i) reduction(+:sum) for(i=0; i<N; ++i) sum += c[i] + d[i]; // without "reduction" would be race condition! } // the end of parallel region -- synchronization std::cout << "sum = " << sum << std::endl; return 0; }
2009 HPC-Chem Summer School, Knoxville, TN 10

Scheduling for loops


loops3.cc
#include <omp.h> #include <iostream> #include <algorithm> int main(int argc, char** argv) { const int N = 1024; double* a = new double[N]; std::fill(a, a+N, 1.0); double* b = new double[N]; std::fill(b, b+N, 2.0); double* c = new double[N]; double sum = 0.0; int i; // schedule this for loop in a dynamic fashion // other options for schedule clause: static, guided, runtime, auto #pragma omp parallel for private(i) schedule(dynamic) for(i=0; i<N; ++i) c[i] = a[i] + b[i]; #pragma omp parallel for private(i) reduction(+:sum) for(i=0; i<N; ++i) sum += c[i]; std::cout << "sum = " << sum << std::endl; return 0; }

2009 HPC-Chem Summer School, Knoxville, TN 11

Collapse nested for loops (in OpenMP v 3.0)


loops4.cc
#include <omp.h> #include <iostream> #include <algorithm> int main(int argc, char** argv) { const int N = 1024; double* a = new double[N]; std::fill(a, a+N, 1.0); double* b = new double[N]; std::fill(b, b+N, 2.0); double* c = new double[N]; double sum = 0.0; #pragma omp parallel // threads live throughout multiple for loops { int i, j; #pragma omp for private(i) // c = a + b for(i=0; i<N; ++i) c[i] = a[i] + b[i]; #pragma omp for private(i,j) collapse(2) reduction(+:sum) for(j=0; j<N; ++j) for(i=0; i<N; ++i) sum += c[i]; } // the end of parallel region -- synchronization std::cout << "sum = " << sum << std::endl; return 0; }

2009 HPC-Chem Summer School, Knoxville, TN 12

Threading SCF: overlap


outer loop only
void make_overlap(int nbf, double* overlap) { int i, j; #pragma omp parallel for private(i, j) // only outer loop parallelized for(i=0; i<nbf; ++i) { for(j=0; j<=i; ++j) { const int IJ = i*nbf + j; const int JI = j*nbf + i; overlap[IJ] = overlap[JI] = s(i,j); } } }

both loops
void make_overlap(int nbf, double* overlap) { int i, j; const int thread_id = omp_get_thread_num(); const int nthread = omp_get_num_threads(); #pragma omp parallel private(i, j) // parallelize imperfectly nested loops for(i=0; i<nbf; ++i) { for(j=0; j<=i; ++j) { const int IJ = i*nbf + j; if (IJ % nthread != thread_id) continue; // round-robin work distribution const int JI = j*nbf + i; overlap[IJ] = overlap[JI] = s(i,j); } } }
2009 HPC-Chem Summer School, Knoxville, TN 13

Threading SCF: overlap


collapse loops in OpenMP v. 3.0
void make_overlap(int nbf, double* overlap) { int i, j; #pragma omp parallel for collapse(2) private(i, j) for(i=0; i<nbf; ++i) { for(j=0; j<nbf; ++j) { const int IJ = i*nbf + j; overlap[IJ] = s(i,j); } } }

2009 HPC-Chem Summer School, Knoxville, TN 14

Threading SCF: Fock matrix

see scf/openmp in the repository

2009 HPC-Chem Summer School, Knoxville, TN 15

Critical sections
loops5.cc

#include <omp.h> #include <iostream> #include <algorithm> int main(int argc, char** argv) { const int N = 1024; double* a = new double[N]; std::fill(a, a+N, 2.0); double* b = new double[N]; std::fill(b, b+N, 3.0); double* c = new double[N]; double sum = 0.0; int i; #pragma omp parallel for private(i) for(i=0; i<N; ++i) c[i] = a[i] + b[i]; #pragma omp parallel for private(i) for(i=0; i<N; ++i) #pragma critical sum += c[i]; // less efficient than "reduction"! std::cout << "sum = " << sum << std::endl; return 0; }

2009 HPC-Chem Summer School, Knoxville, TN 16

Cache considerations
cache.cc
#include <omp.h> #define N 65536 /* 64k */

int main() { char source[N], destination[N]; int j; #pragma omp parallel shared(source, destination) { const int thread_id = omp_get_thread_num(); const int nthread = omp_get_num_threads(); for (int j=0; j < 10000; j++) { for(int k=thread_id; k<N; k+=nthread) destination[k] = source[k]; // cache coherence penalty } } return(0); }

2 threads take about the same time as 1 on my laptop

2009 HPC-Chem Summer School, Knoxville, TN 17

Summary
What else is there: sections, atomic stores/loads, nested parallelism, locks OpenMP is easy. except when it isnt The current version is 2.5. OpenMP v 3.0 adds additional useful features (parallel nested loops, tasks). Look for it in GNU (4.4+) and Intel compilers.

2009 HPC-Chem Summer School, Knoxville, TN 18

Homework
Compile and run OpenMP helloworld.cc. Play with it! Enable/disable critical section and barrier. What happens? Why? Use omp_get_num_procs() function to obtain the number of cores on the machine. How many does a Kraken node have? Surveyor? Modify MPI SCF code to use OpenMP also for multi-level parallelism.

2009 HPC-Chem Summer School, Knoxville, TN 19

S-ar putea să vă placă și