💾Intro to Computer Architecture Unit 7 – Parallel and Multiprocessing Systems

Parallel and multiprocessing systems are the backbone of modern computing, enabling faster and more efficient execution of complex tasks. These systems leverage multiple processors or cores to perform computations simultaneously, dramatically improving performance in various applications. From shared memory architectures to distributed systems, parallel computing offers diverse approaches to tackle computational challenges. Understanding key concepts like speedup, Amdahl's Law, and programming models is crucial for harnessing the power of parallel processing in real-world scenarios.

Study Guides for Unit 7

7.1

Instruction-level parallelism (ILP)

5 min read

7.2

Thread-level parallelism (TLP)

5 min read

7.3

Multicore processors and cache coherence

5 min read

7.4

Interconnection networks and topologies

5 min read

Got a Unit Test this week?

we crunched the numbers and here's the most likely topics on your next test

Key Concepts and Terminology

Parallel processing involves executing multiple tasks or instructions simultaneously to improve performance and efficiency
Concurrency refers to the ability of a system to handle multiple tasks or processes at the same time, which may or may not be executed in parallel
Speedup measures the performance improvement gained by using parallel processing compared to sequential processing and is calculated as $\text{Speedup} = \frac{\text{Sequential execution time}}{\text{Parallel execution time}}$
Amdahl's Law states that the overall speedup of a parallel system is limited by the portion of the program that cannot be parallelized and is expressed as $\text{Speedup} = \frac{1}{(1-P) + \frac{P}{N}}$ $Speedup = \frac{1}{( 1 - P ) + \frac{P}{N}}$ , where $P$ $P$ is the fraction of the program that can be parallelized and $N$ $N$ is the number of processors
- Amdahl's Law implies that there is a limit to the speedup that can be achieved through parallelization, even with an infinite number of processors
Gustafson's Law, also known as scaled speedup, considers the case where the problem size grows with the number of processors and is expressed as $\text{Scaled Speedup} = N - (1-P)(N-1)$ , where $N$ is the number of processors and $P$ is the fraction of the program that can be parallelized
Granularity refers to the size of the tasks or units of work that can be executed in parallel
- Fine-grained parallelism involves breaking down a problem into small, independent tasks that can be executed concurrently
- Coarse-grained parallelism involves dividing a problem into larger, more complex tasks that can be executed in parallel

Types of Parallel Systems

Shared memory systems consist of multiple processors that share a common memory space, allowing for efficient communication and data sharing between processors (symmetric multiprocessing systems)
Distributed memory systems comprise multiple independent processing nodes, each with its own local memory, connected through a network or interconnect (clusters, massively parallel processing systems)
Hybrid systems combine shared memory and distributed memory architectures, leveraging the advantages of both approaches (multi-core processors with distributed memory)
SIMD (Single Instruction, Multiple Data) systems execute the same instruction on multiple data elements simultaneously, commonly used in vector processors and GPUs
MIMD (Multiple Instruction, Multiple Data) systems allow each processor to execute different instructions on different data sets independently, providing flexibility and adaptability to various parallel workloads
Heterogeneous systems incorporate different types of processing units, such as CPUs and GPUs, to leverage their unique strengths for specific tasks (accelerated computing)

Multiprocessing Architectures

Symmetric Multiprocessing (SMP) systems consist of multiple identical processors that share a common memory space and operating system, allowing for equal access to resources and load balancing
Non-Uniform Memory Access (NUMA) systems extend the SMP architecture by introducing multiple memory nodes, each associated with a subset of processors, resulting in varying memory access latencies depending on the location of the data
Massively Parallel Processing (MPP) systems comprise a large number of independent processing nodes, each with its own memory and interconnect, designed for scalable performance on large-scale problems
Cluster computing involves connecting multiple standalone computers through a high-speed network to form a unified computing resource, enabling cost-effective parallel processing
Multi-core processors integrate multiple processing cores on a single chip, allowing for parallel execution within a single processor package
Many-core processors extend the multi-core concept by incorporating a large number of simpler processing cores, optimized for highly parallel workloads (Intel Xeon Phi, NVIDIA GPUs)

Memory Organization in Parallel Systems

Shared memory organization allows multiple processors to access a common memory space, enabling efficient communication and data sharing between processors
- Uniform Memory Access (UMA) provides equal access latency to all memory locations for all processors, simplifying programming and load balancing
- Non-Uniform Memory Access (NUMA) introduces multiple memory nodes with varying access latencies, requiring careful data placement and locality optimizations
Distributed memory organization assigns each processor its own local memory, requiring explicit communication and data transfer between processors through message passing or remote memory access
Cache coherence mechanisms ensure data consistency across multiple caches in shared memory systems, preventing stale or inconsistent data (snooping protocols, directory-based protocols)
Memory consistency models define the rules and guarantees for memory operations in parallel systems, specifying the order and visibility of memory accesses (sequential consistency, relaxed consistency models)
Synchronization primitives, such as locks, semaphores, and barriers, are used to coordinate access to shared resources and ensure correct execution order in parallel programs

Parallel Programming Models

Shared memory programming models, such as OpenMP and Pthreads, provide abstractions for parallel execution and synchronization in shared memory systems, using directives, libraries, or language extensions
Message passing programming models, such as MPI (Message Passing Interface), enable parallel programming in distributed memory systems by explicitly defining communication and synchronization between processes
Data parallel programming models, such as OpenCL and CUDA, focus on parallel execution of the same operation across multiple data elements, commonly used in GPGPU (General-Purpose computing on Graphics Processing Units) programming
Task-based programming models, such as Intel Threading Building Blocks (TBB) and Cilk, express parallelism through the decomposition of a problem into tasks or units of work, which are dynamically scheduled and executed by the runtime system
Functional programming languages, such as Erlang and Haskell, provide built-in support for parallelism and concurrency through immutable data structures and side-effect-free functions
Domain-specific languages and frameworks, such as Apache Spark and TensorFlow, offer high-level abstractions and optimizations for parallel processing in specific application domains (big data processing, machine learning)

Performance Metrics and Optimization

Execution time measures the total time taken to complete a parallel program, including computation, communication, and synchronization overheads
Speedup quantifies the performance improvement achieved by parallel processing compared to sequential execution, calculated as the ratio of sequential execution time to parallel execution time
Efficiency measures the utilization of parallel resources, calculated as the ratio of speedup to the number of processors, indicating the effectiveness of parallelization
Scalability refers to the ability of a parallel system to maintain performance as the problem size and the number of processors increase, often evaluated using metrics such as strong scaling and weak scaling
Load balancing techniques aim to distribute the workload evenly across available processors or nodes, minimizing idle time and maximizing resource utilization (static load balancing, dynamic load balancing)
Data locality optimizations focus on minimizing data movement and maximizing the reuse of data in caches or local memory, reducing communication overheads and improving performance (data partitioning, cache blocking)
Communication optimization techniques, such as message aggregation and overlapping communication with computation, aim to reduce the impact of communication latencies and improve parallel efficiency

Challenges and Limitations

Amdahl's Law sets a theoretical limit on the speedup achievable through parallelization, as the sequential portion of the program becomes a bottleneck with increasing number of processors
Synchronization overheads, such as lock contention and barrier waiting times, can significantly impact the performance of parallel programs, especially in fine-grained parallelism
Communication overheads, including latency and bandwidth limitations, can limit the scalability of parallel systems, particularly in distributed memory architectures
Load imbalance, where some processors or nodes have more work than others, can lead to underutilization of resources and reduced parallel efficiency
Data dependencies and race conditions can introduce correctness issues in parallel programs, requiring careful synchronization and coordination between parallel tasks
Debugging and testing parallel programs can be challenging due to the non-deterministic nature of parallel execution and the potential for subtle synchronization bugs (deadlocks, data races)
Portability and performance portability issues arise when moving parallel programs across different architectures or systems, requiring adaptation and optimization for each target platform

Real-world Applications

Scientific simulations and modeling, such as weather forecasting, molecular dynamics, and computational fluid dynamics, leverage parallel processing to solve complex mathematical problems
Big data analytics and machine learning applications, including data mining, graph processing, and deep learning, utilize parallel computing to process and analyze massive datasets efficiently
Computer graphics and rendering, such as 3D animation, visual effects, and gaming, employ parallel algorithms and GPU acceleration to achieve real-time performance and high-quality visuals
Cryptography and security applications, such as password cracking and encryption, use parallel processing to perform computationally intensive tasks and improve security measures
Bioinformatics and genomics research, including DNA sequencing and protein folding simulations, rely on parallel computing to analyze large biological datasets and accelerate scientific discoveries
Financial modeling and risk analysis, such as Monte Carlo simulations and options pricing, utilize parallel processing to perform complex financial computations and risk assessments
Computer-aided engineering and design, including finite element analysis and computational fluid dynamics, leverage parallel computing to simulate and optimize complex engineering systems and designs