Fiveable

💻Parallel and Distributed Computing Unit 4 Review

QR code for Parallel and Distributed Computing practice questions

4.1 OpenMP Fundamentals and Directives

4.1 OpenMP Fundamentals and Directives

Written by the Fiveable Content Team • Last updated August 2025
Written by the Fiveable Content Team • Last updated August 2025
💻Parallel and Distributed Computing
Unit & Topic Study Guides

OpenMP Concepts and Architecture

OpenMP (Open Multi-Processing) is a standard for shared-memory parallel programming in C, C++, and Fortran. It lets you take existing sequential code and parallelize it incrementally, using compiler directives, library routines, and environment variables. The OpenMP Architecture Review Board (ARB) manages the specification, and implementations exist across platforms from laptops to supercomputers.

Thread-Based Parallelism Model

OpenMP uses a fork-join execution model. Your program starts as a single master thread. When it hits a parallel region, the master "forks" a team of worker threads. Those threads execute concurrently, then "join" back into the master thread when the region ends. Sequential execution resumes until the next parallel region.

The OpenMP runtime decides how to map threads to processors based on system load and available cores. You can influence this through environment variables (like OMP_NUM_THREADS) or programmatically within your code.

Flexibility and Scalability

A major advantage of OpenMP is incremental parallelization. You don't have to rewrite your whole program. Instead, you identify hotspots (loops, independent tasks) and add directives to just those sections. This makes it practical for large existing codebases where a full rewrite isn't feasible.

OpenMP scales across hardware, from a quad-core desktop to a many-core HPC node. Fine-grained control over thread count, scheduling, and data sharing lets you tune performance for your specific architecture.

OpenMP Directives for Parallelization

Basic Directive Syntax

OpenMP directives tell the compiler which code sections to parallelize and how. In C/C++, they take this form:

</>Code
# pragma omp directive-name [clause, ...]

In Fortran, the equivalent is:

</>Code
!$OMP directive-name [clause, ...]

Clauses modify the behavior of a directive (e.g., controlling data sharing or scheduling). A directive without clauses uses default behavior.

Core Parallelization Directives

  • parallel creates a team of threads. All code inside the parallel block runs on every thread unless further directives distribute the work.
  • for (C/C++) / do (Fortran) distributes loop iterations across threads. Each thread gets a subset of iterations.
  • sections assigns distinct code blocks to different threads, useful when you have independent tasks that aren't loop-based.
  • single restricts a block so only one thread executes it (e.g., for I/O or initialization), while other threads wait at an implicit barrier.

A common pattern combines parallel and for to parallelize a loop:

</>C
# pragma omp parallel
{
  #pragma omp for
  for(int i = 0; i < N; i++) {
    // iterations distributed across threads
  }
}

You can also write this as a single combined directive: #pragma omp parallel for.

Core Components and Structure, OpenMP - Wikipedia

Control Clauses and Work Distribution

Clauses give you precise control over how data is handled and how work is divided:

  • private(var) gives each thread its own uninitialized copy of var.
  • shared(var) makes var accessible to all threads (this is the default for most variables).
  • reduction(op:var) gives each thread a private copy of var, then combines all copies using operator op (like +, *, max) at the end of the region. This safely handles accumulations without manual synchronization.
  • schedule(type) controls how loop iterations map to threads. Common types:
    • static splits iterations into equal-sized chunks assigned at compile time.
    • dynamic assigns chunks to threads as they become free, better for uneven workloads.
    • guided starts with large chunks and shrinks them, balancing load with less overhead than dynamic.
</>C
# pragma omp parallel for private(x) shared(y) reduction(+:sum) schedule(dynamic)
for(int i = 0; i < N; i++) {
  x = compute(i);
  sum += x;
}

Here, each thread gets its own x, all threads share y, and partial sums are safely combined at the end.

Fork-Join Model in OpenMP

Execution Flow

The fork-join cycle follows a clear pattern:

  1. Program runs sequentially on the master thread.
  2. Master encounters a parallel directive and forks a team of threads.
  3. All threads (including the master) execute the parallel region concurrently.
  4. Threads hit the end of the parallel region and join back. An implicit barrier ensures all threads finish before continuing.
  5. Only the master thread continues sequential execution.

Thread Management

You control the number of threads in several ways:

  • The num_threads(N) clause on a specific parallel directive
  • The OMP_NUM_THREADS environment variable (applies globally)
  • The omp_set_num_threads(N) library call

Nested parallelism occurs when a parallel region appears inside another parallel region. By default, the inner region runs with just one thread. To enable true nesting, set OMP_NESTED=true (or call omp_set_nested(1)).

</>C
# pragma omp parallel num_threads(4)
{
  // 4 threads running here
  #pragma omp parallel num_threads(2)
  {
    // If nesting enabled: up to 8 total threads
    // If nesting disabled: inner region runs on 1 thread per outer thread
  }
}

Performance Implications

Every fork and join introduces overhead: thread creation (or wake-up), synchronization, and potential cache effects. To minimize this cost:

  • Make parallel regions large enough that the work outweighs the overhead.
  • Avoid forking and joining repeatedly inside tight loops. Instead, put the parallel region outside the loop.
  • Watch for load imbalance, where some threads finish early and sit idle. The schedule(dynamic) clause can help when iteration costs vary.
Core Components and Structure, OpenMP - Wikipedia

Shared vs. Private Variables

Getting data sharing right is one of the trickiest parts of OpenMP. Mistakes here cause race conditions, where threads read and write the same memory unpredictably, producing wrong results that may differ between runs.

Shared Variables

Most variables are shared by default in a parallel region. All threads see the same memory location, which makes shared variables useful for communication but dangerous if multiple threads write to them without protection.

</>C
int sum = 0;
# pragma omp parallel shared(sum)
{
  // All threads read/write the same 'sum' — race condition risk
}

Private Variables

A private variable gives each thread its own independent copy. The loop index in a for directive is automatically private. Other variables need the private clause.

One gotcha: private variables are uninitialized when the parallel region starts and undefined when it ends. Don't assume they carry values in or out.

</>C
# pragma omp parallel for private(temp)
for(int i = 0; i < N; i++) {
  temp = array[i] * 2;  // each thread has its own 'temp'
  result[i] = temp + 1;
}

Data Sharing Variants

Two variants solve the initialization and finalization problem:

  • firstprivate(var) initializes each thread's private copy with the value var had before the parallel region.
  • lastprivate(var) copies the value from the thread that executed the logically last iteration back to the original variable after the region ends.
</>C
int x = 5;
# pragma omp parallel for firstprivate(x) lastprivate(x)
for(int i = 0; i < N; i++) {
  x += i;  // each thread starts with x = 5
}
// After the loop, x holds the value from iteration N-1

Preventing Race Conditions

When threads must write to shared data, you need synchronization:

  • critical ensures only one thread at a time executes a block.
  • atomic protects a single memory update (lighter weight than critical).
  • barrier forces all threads to wait until every thread reaches that point.
  • reduction is often the cleanest solution for accumulations, since it avoids explicit locking entirely.
</>C
# pragma omp parallel
{
  #pragma omp critical
  {
    shared_counter++;  // only one thread at a time
  }
}

Use critical and atomic sparingly. If too much of your code is serialized by synchronization, you lose the benefit of parallelism.