🥸Advanced Computer Architecture Unit 1 – Intro to Advanced Computer Architecture

Advanced Computer Architecture explores the design and optimization of computer systems, focusing on processor design, memory hierarchy, and parallel processing techniques. It delves into instruction set architectures, pipelining, and cache organization to enhance performance and efficiency. This field examines how to maximize instruction-level parallelism, implement multi-core architectures, and evaluate system performance. It also considers the impact of technology trends on computer design, balancing performance, power consumption, and reliability in modern systems.

Study Guides for Unit 1

1.1

Evolution of Computer Architecture

9 min read

1.2

Performance Metrics and Evaluation

5 min read

1.3

Instruction Set Architecture (ISA) Design

8 min read

1.4

Advanced Processor Organizations

6 min read

Key Concepts and Foundations

Computer architecture encompasses the design, organization, and implementation of computer systems
Focuses on the interface between hardware and software, optimizing performance, power efficiency, and reliability
Includes the study of instruction set architectures (ISAs), which define the basic operations a processor can execute
Explores the organization of processor components such as arithmetic logic units (ALUs), control units, and registers
Examines memory hierarchy, including cache, main memory, and secondary storage, to optimize data access and minimize latency
Investigates parallel processing techniques, such as pipelining and multi-core architectures, to enhance performance
Considers the impact of technology trends, such as Moore's Law and the power wall, on computer architecture design

Processor Design Principles

Processors are designed to execute instructions efficiently and quickly, maximizing performance while minimizing power consumption
RISC (Reduced Instruction Set Computing) architectures emphasize simple, fixed-length instructions that can be executed in a single cycle
- Examples of RISC architectures include ARM and MIPS
CISC (Complex Instruction Set Computing) architectures support more complex, variable-length instructions that may require multiple cycles to execute
- x86 is a well-known example of a CISC architecture
Pipelining is a technique that overlaps the execution of multiple instructions, allowing the processor to begin executing a new instruction before the previous one has completed
Superscalar architectures enable the execution of multiple instructions simultaneously by duplicating functional units (such as ALUs)
Out-of-order execution allows instructions to be executed in a different order than they appear in the program, based on data dependencies and resource availability
Branch prediction techniques, such as static and dynamic prediction, aim to minimize the impact of control hazards caused by conditional branching instructions

Memory Hierarchy and Management

Memory hierarchy organizes storage devices based on their capacity, speed, and cost, with faster and more expensive memory closer to the processor
Registers are the fastest and most expensive memory, located within the processor and used for temporary storage of operands and results
Cache memory is a small, fast memory between the processor and main memory, designed to store frequently accessed data and instructions
- Caches are organized into levels (L1, L2, L3) with increasing capacity and latency
Main memory (RAM) is larger and slower than cache, storing the active portions of programs and data
Secondary storage (hard drives, SSDs) has the largest capacity but the slowest access times, used for long-term storage of programs and data
Virtual memory techniques, such as paging and segmentation, allow the operating system to manage memory by providing a logical address space larger than the physical memory
Memory management units (MMUs) translate logical addresses to physical addresses and handle memory protection and allocation

Instruction-Level Parallelism

Instruction-level parallelism (ILP) refers to the ability to execute multiple instructions simultaneously within a single processor core
ILP can be exploited through techniques such as pipelining, superscalar execution, and out-of-order execution
Data dependencies, such as true dependencies (read-after-write) and anti-dependencies (write-after-read), can limit the amount of ILP that can be achieved
Instruction scheduling techniques, such as scoreboarding and Tomasulo's algorithm, aim to maximize ILP by reordering instructions based on their dependencies
Register renaming eliminates false dependencies (write-after-write) by mapping architectural registers to a larger set of physical registers
Speculative execution allows the processor to execute instructions before it is certain that they will be needed, based on branch predictions
Very Long Instruction Word (VLIW) architectures explicitly specify the parallelism in the instruction stream, placing the burden of scheduling on the compiler

Pipelining and Superscalar Architectures

Pipelining divides instruction execution into stages (fetch, decode, execute, memory access, write-back), allowing multiple instructions to be in different stages simultaneously
Pipeline hazards, such as structural hazards (resource conflicts), data hazards (dependencies), and control hazards (branches), can stall the pipeline and reduce performance
Forwarding (bypassing) is a technique used to mitigate data hazards by passing results directly between pipeline stages, avoiding the need to wait for them to be written back to registers
Superscalar architectures issue multiple instructions per cycle to multiple functional units, exploiting instruction-level parallelism
Dynamic scheduling techniques, such as Tomasulo's algorithm and the Reorder Buffer (ROB), enable out-of-order execution in superscalar processors
Branch prediction and speculative execution are crucial for maintaining high performance in pipelined and superscalar architectures
Deeply pipelined architectures have a larger number of stages, allowing for higher clock frequencies but increasing the impact of hazards and branch mispredictions

Cache Organization and Optimization

Caches are organized into lines (blocks), each containing multiple words of data or instructions
Cache mapping policies determine how memory addresses are mapped to cache lines:
- Direct-mapped caches map each memory address to a single cache line
- Set-associative caches map each memory address to a set of cache lines, allowing for more flexibility and reduced conflicts
- Fully-associative caches allow any memory address to be mapped to any cache line, providing the most flexibility but requiring more complex hardware
Cache replacement policies, such as Least Recently Used (LRU) and Random, determine which cache line to evict when a new line needs to be brought in
Write policies determine how writes to the cache are handled:
- Write-through caches immediately update both the cache and main memory on a write
- Write-back caches update only the cache on a write and mark the line as dirty, writing it back to main memory only when the line is evicted
Cache coherence protocols, such as MESI and MOESI, ensure that multiple copies of data in different caches remain consistent in multi-core and multi-processor systems
Cache optimization techniques, such as prefetching and victim caches, aim to reduce cache misses and improve performance

Multi-core and Parallel Processing

Multi-core processors integrate multiple processor cores on a single chip, allowing for thread-level parallelism (TLP)
Symmetric multiprocessing (SMP) architectures provide each core with equal access to shared memory and resources
Non-Uniform Memory Access (NUMA) architectures have memory physically distributed among the cores, with varying access latencies depending on the memory location
Shared memory programming models, such as OpenMP and Pthreads, allow developers to express parallelism using threads that communicate through shared variables
Message passing programming models, such as MPI, enable parallelism by having processes communicate through explicit messages
Synchronization primitives, such as locks, semaphores, and barriers, are used to coordinate access to shared resources and ensure correct parallel execution
Cache coherence and memory consistency models, such as sequential consistency and relaxed consistency, define the allowable orderings of memory operations in parallel systems
Heterogeneous computing architectures, such as GPUs and FPGAs, offer specialized hardware for parallel processing of specific workloads

Performance Metrics and Evaluation

Performance metrics quantify the efficiency and effectiveness of computer architectures in executing programs
Execution time measures the total time required to complete a program, including computation, memory accesses, and I/O operations
Throughput represents the number of tasks or instructions completed per unit of time (e.g., instructions per second)
Latency refers to the time delay between the initiation of an operation and its completion, such as the time to access memory or execute an instruction
Speedup compares the performance of an optimized or parallel implementation to a baseline, sequential implementation: $Speedup = \frac{ExecutionTime_{sequential}}{ExecutionTime_{parallel}}$
Amdahl's Law describes the maximum speedup achievable through parallelization, based on the fraction of the program that can be parallelized: $Speedup \leq \frac{1}{(1-f)+\frac{f}{N}}$ , where $f$ is the parallel fraction and $N$ is the number of processors
Scalability refers to the ability of a system to maintain performance as the problem size or the number of processing elements increases
Benchmarks, such as SPEC CPU and PARSEC, provide standardized workloads and metrics for evaluating and comparing the performance of different computer architectures