🥸Advanced Computer Architecture Unit 1 – Intro to Advanced Computer Architecture
Advanced Computer Architecture explores the design and optimization of computer systems, focusing on processor design, memory hierarchy, and parallel processing techniques. It delves into instruction set architectures, pipelining, and cache organization to enhance performance and efficiency.
This field examines how to maximize instruction-level parallelism, implement multi-core architectures, and evaluate system performance. It also considers the impact of technology trends on computer design, balancing performance, power consumption, and reliability in modern systems.
Computer architecture encompasses the design, organization, and implementation of computer systems
Focuses on the interface between hardware and software, optimizing performance, power efficiency, and reliability
Includes the study of instruction set architectures (ISAs), which define the basic operations a processor can execute
Explores the organization of processor components such as arithmetic logic units (ALUs), control units, and registers
Examines memory hierarchy, including cache, main memory, and secondary storage, to optimize data access and minimize latency
Investigates parallel processing techniques, such as pipelining and multi-core architectures, to enhance performance
Considers the impact of technology trends, such as Moore's Law and the power wall, on computer architecture design
Processor Design Principles
Processors are designed to execute instructions efficiently and quickly, maximizing performance while minimizing power consumption
RISC (Reduced Instruction Set Computing) architectures emphasize simple, fixed-length instructions that can be executed in a single cycle
Examples of RISC architectures include ARM and MIPS
CISC (Complex Instruction Set Computing) architectures support more complex, variable-length instructions that may require multiple cycles to execute
x86 is a well-known example of a CISC architecture
Pipelining is a technique that overlaps the execution of multiple instructions, allowing the processor to begin executing a new instruction before the previous one has completed
Superscalar architectures enable the execution of multiple instructions simultaneously by duplicating functional units (such as ALUs)
Out-of-order execution allows instructions to be executed in a different order than they appear in the program, based on data dependencies and resource availability
Branch prediction techniques, such as static and dynamic prediction, aim to minimize the impact of control hazards caused by conditional branching instructions
Memory Hierarchy and Management
Memory hierarchy organizes storage devices based on their capacity, speed, and cost, with faster and more expensive memory closer to the processor
Registers are the fastest and most expensive memory, located within the processor and used for temporary storage of operands and results
Cache memory is a small, fast memory between the processor and main memory, designed to store frequently accessed data and instructions
Caches are organized into levels (L1, L2, L3) with increasing capacity and latency
Main memory (RAM) is larger and slower than cache, storing the active portions of programs and data
Secondary storage (hard drives, SSDs) has the largest capacity but the slowest access times, used for long-term storage of programs and data
Virtual memory techniques, such as paging and segmentation, allow the operating system to manage memory by providing a logical address space larger than the physical memory
Memory management units (MMUs) translate logical addresses to physical addresses and handle memory protection and allocation
Instruction-Level Parallelism
Instruction-level parallelism (ILP) refers to the ability to execute multiple instructions simultaneously within a single processor core
ILP can be exploited through techniques such as pipelining, superscalar execution, and out-of-order execution
Data dependencies, such as true dependencies (read-after-write) and anti-dependencies (write-after-read), can limit the amount of ILP that can be achieved
Instruction scheduling techniques, such as scoreboarding and Tomasulo's algorithm, aim to maximize ILP by reordering instructions based on their dependencies
Register renaming eliminates false dependencies (write-after-write) by mapping architectural registers to a larger set of physical registers
Speculative execution allows the processor to execute instructions before it is certain that they will be needed, based on branch predictions
Very Long Instruction Word (VLIW) architectures explicitly specify the parallelism in the instruction stream, placing the burden of scheduling on the compiler
Pipelining and Superscalar Architectures
Pipelining divides instruction execution into stages (fetch, decode, execute, memory access, write-back), allowing multiple instructions to be in different stages simultaneously
Pipeline hazards, such as structural hazards (resource conflicts), data hazards (dependencies), and control hazards (branches), can stall the pipeline and reduce performance
Forwarding (bypassing) is a technique used to mitigate data hazards by passing results directly between pipeline stages, avoiding the need to wait for them to be written back to registers
Superscalar architectures issue multiple instructions per cycle to multiple functional units, exploiting instruction-level parallelism
Dynamic scheduling techniques, such as Tomasulo's algorithm and the Reorder Buffer (ROB), enable out-of-order execution in superscalar processors
Branch prediction and speculative execution are crucial for maintaining high performance in pipelined and superscalar architectures
Deeply pipelined architectures have a larger number of stages, allowing for higher clock frequencies but increasing the impact of hazards and branch mispredictions
Cache Organization and Optimization
Caches are organized into lines (blocks), each containing multiple words of data or instructions
Cache mapping policies determine how memory addresses are mapped to cache lines:
Direct-mapped caches map each memory address to a single cache line
Set-associative caches map each memory address to a set of cache lines, allowing for more flexibility and reduced conflicts
Fully-associative caches allow any memory address to be mapped to any cache line, providing the most flexibility but requiring more complex hardware
Cache replacement policies, such as Least Recently Used (LRU) and Random, determine which cache line to evict when a new line needs to be brought in
Write policies determine how writes to the cache are handled:
Write-through caches immediately update both the cache and main memory on a write
Write-back caches update only the cache on a write and mark the line as dirty, writing it back to main memory only when the line is evicted
Cache coherence protocols, such as MESI and MOESI, ensure that multiple copies of data in different caches remain consistent in multi-core and multi-processor systems
Cache optimization techniques, such as prefetching and victim caches, aim to reduce cache misses and improve performance
Multi-core and Parallel Processing
Multi-core processors integrate multiple processor cores on a single chip, allowing for thread-level parallelism (TLP)
Symmetric multiprocessing (SMP) architectures provide each core with equal access to shared memory and resources
Non-Uniform Memory Access (NUMA) architectures have memory physically distributed among the cores, with varying access latencies depending on the memory location
Shared memory programming models, such as OpenMP and Pthreads, allow developers to express parallelism using threads that communicate through shared variables
Message passing programming models, such as MPI, enable parallelism by having processes communicate through explicit messages
Synchronization primitives, such as locks, semaphores, and barriers, are used to coordinate access to shared resources and ensure correct parallel execution
Cache coherence and memory consistency models, such as sequential consistency and relaxed consistency, define the allowable orderings of memory operations in parallel systems
Heterogeneous computing architectures, such as GPUs and FPGAs, offer specialized hardware for parallel processing of specific workloads
Performance Metrics and Evaluation
Performance metrics quantify the efficiency and effectiveness of computer architectures in executing programs
Execution time measures the total time required to complete a program, including computation, memory accesses, and I/O operations
Throughput represents the number of tasks or instructions completed per unit of time (e.g., instructions per second)
Latency refers to the time delay between the initiation of an operation and its completion, such as the time to access memory or execute an instruction
Speedup compares the performance of an optimized or parallel implementation to a baseline, sequential implementation: Speedup=ExecutionTimeparallelExecutionTimesequential
Amdahl's Law describes the maximum speedup achievable through parallelization, based on the fraction of the program that can be parallelized: Speedup≤(1−f)+Nf1, where f is the parallel fraction and N is the number of processors
Scalability refers to the ability of a system to maintain performance as the problem size or the number of processing elements increases
Benchmarks, such as SPEC CPU and PARSEC, provide standardized workloads and metrics for evaluating and comparing the performance of different computer architectures