🥸Advanced Computer Architecture Unit 2 – Instruction Parallelism and Pipelining

Instruction parallelism and pipelining are key techniques for boosting processor performance. These methods allow multiple instructions to be executed simultaneously, increasing throughput and efficiency. By breaking down instruction execution into stages and overlapping them, processors can handle more work in less time. Pipelining divides instruction execution into stages like fetch, decode, and execute. This assembly-line approach lets processors work on multiple instructions at once. However, challenges like data dependencies and branch prediction must be addressed to maintain smooth pipeline flow and maximize performance gains.

Fundamentals of Instruction Parallelism

  • Instruction parallelism exploits the potential for instructions to be executed simultaneously
  • Increases the overall throughput and performance of a processor by utilizing multiple functional units
  • Requires the identification of independent instructions that can be executed in parallel without data dependencies
  • Instruction-level parallelism (ILP) is a measure of how many instructions can be executed concurrently in a program
  • Compiler techniques (loop unrolling, software pipelining) and hardware techniques (out-of-order execution, superscalar) are used to exploit ILP
  • Amdahl's Law states that the speedup of a program is limited by the fraction of the program that cannot be parallelized
  • Flynn's Taxonomy classifies computer architectures based on the number of concurrent instruction and data streams (SISD, SIMD, MISD, MIMD)

Pipelining Basics and Concepts

  • Pipelining is a technique that divides the execution of an instruction into multiple stages
  • Enables overlapping execution of multiple instructions, similar to an assembly line
  • Each stage of the pipeline performs a specific task (fetch, decode, execute, memory access, write-back)
  • Instruction pipeline increases the overall throughput of the processor by reducing the average execution time per instruction
  • Pipeline registers are used to store intermediate results between pipeline stages
  • Ideal CPI (Cycles Per Instruction) in a pipelined processor approaches 1, indicating that one instruction is completed every clock cycle
  • Pipeline depth refers to the number of stages in the pipeline and affects the clock frequency and instruction latency

Pipeline Hazards and Mitigation Strategies

  • Pipeline hazards are situations that prevent the next instruction from executing during its designated clock cycle
  • Structural hazards occur when hardware resources (memory, register file) are required by multiple instructions simultaneously
    • Mitigated by duplicating hardware resources or using separate instruction and data memories
  • Data hazards arise when an instruction depends on the result of a previous instruction that has not yet completed
    • Mitigated by forwarding (bypassing) results between pipeline stages or stalling the pipeline until the dependency is resolved
  • Control hazards occur when the outcome of a branch instruction is not known, causing subsequent instructions to be fetched incorrectly
    • Mitigated by branch prediction techniques (static, dynamic) and speculative execution
  • Instruction scheduling and compiler optimizations can help reduce pipeline hazards by reordering instructions

Superscalar and VLIW Architectures

  • Superscalar architectures issue multiple instructions per clock cycle from a single instruction stream
  • Dynamically identify independent instructions and dispatch them to multiple functional units
  • Require complex hardware for instruction scheduling, dependency checking, and out-of-order execution
  • Very Long Instruction Word (VLIW) architectures use a fixed-length instruction format with multiple operation fields
  • VLIW instructions specify multiple independent operations that can be executed in parallel
  • Rely on the compiler to perform instruction scheduling and dependency analysis statically
  • Tradeoffs between hardware complexity (superscalar) and compiler complexity (VLIW)
  • Examples of superscalar processors: Intel Core, AMD Ryzen; VLIW processors: Itanium, TI C6000

Branch Prediction and Speculative Execution

  • Branch prediction techniques aim to minimize the impact of control hazards in pipelined processors
  • Static branch prediction uses fixed heuristics (always taken, always not taken) based on instruction type or branch direction
  • Dynamic branch prediction uses runtime information (branch history table, two-level adaptive predictors) to make predictions
  • Branch target buffer (BTB) caches the target addresses of recently executed branches to avoid pipeline stalls
  • Speculative execution allows the processor to fetch and execute instructions along the predicted path before the branch outcome is known
  • If the branch prediction is incorrect, the speculatively executed instructions are discarded, and the pipeline is flushed
  • Branch prediction accuracy is crucial for maintaining high performance in pipelined processors
  • Advanced branch prediction techniques (neural branch prediction, perceptron-based predictors) improve prediction accuracy

Memory System Support for Pipelining

  • Memory hierarchy design plays a crucial role in supporting pipelined execution
  • Caches (L1, L2, L3) provide fast access to frequently used instructions and data, reducing memory access latency
  • Cache miss penalties can stall the pipeline, degrading performance
  • Techniques like cache prefetching, non-blocking caches, and out-of-order memory accesses help hide cache miss latency
  • Memory disambiguation techniques (load-store queues, memory dependence prediction) resolve memory dependencies in out-of-order execution
  • Instruction cache and data cache are often separated to avoid conflicts and improve bandwidth
  • Memory consistency models (sequential consistency, weak ordering) define the ordering constraints for memory operations in pipelined systems

Performance Analysis of Pipelined Systems

  • Performance metrics for pipelined processors include throughput, latency, and speedup
  • Throughput represents the number of instructions completed per unit time (instructions per cycle, IPC)
  • Latency is the time taken to complete a single instruction from start to finish
  • Speedup is the ratio of the execution time of a non-pipelined processor to that of a pipelined processor
  • Pipeline stalls and hazards impact the actual performance, reducing the achieved IPC below the ideal value
  • Average instruction execution time is affected by the pipeline depth, stage delays, and stall cycles
  • Amdahl's Law limits the overall speedup achievable through pipelining based on the fraction of non-pipelined execution
  • Performance analysis tools (hardware counters, simulation, profiling) help identify bottlenecks and optimize pipelined systems
  • Superpipelining increases the number of pipeline stages to achieve higher clock frequencies
  • Superthreading (simultaneous multithreading, SMT) allows multiple threads to execute concurrently on a single pipeline
  • Speculative multithreading speculatively executes multiple threads in parallel, exploiting thread-level parallelism
  • Trace caches store decoded instructions in the order of program execution, reducing decode and fetch latencies
  • Decoupled architectures separate the instruction fetch and execution units, allowing them to operate independently
  • Dataflow architectures execute instructions based on data availability rather than sequential program order
  • Reconfigurable architectures (FPGAs) allow the hardware to be customized for specific application requirements
  • Future trends in pipelining include adaptive pipeline depths, dynamic resource allocation, and hybrid architectures combining different parallelism techniques


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.