Operating Systems

🖲️Operating Systems Unit 7 – Distributed Systems

Distributed systems are networks of autonomous computers working together as a unified system. They offer scalability, fault tolerance, and resource sharing by connecting multiple nodes through networks, enabling parallel processing and workload distribution across locations. Key concepts include scalability, fault tolerance, transparency, and decentralization. Various architectures like client-server, peer-to-peer, and event-driven are used. Communication methods involve message passing, remote procedure calls, and publish-subscribe models, while synchronization and consistency mechanisms ensure proper coordination.

What Are Distributed Systems?

  • Distributed systems are computing systems where multiple autonomous computers work together to appear as a single coherent system to end-users
  • Consist of multiple nodes (computers or servers) connected through a network, collaborating to achieve a common goal or perform a specific task
  • Each node has its own local memory and operates concurrently, communicating with other nodes through message passing
  • Distributed systems aim to provide scalability, fault tolerance, resource sharing, and transparency to users
  • Can be geographically dispersed across different locations (data centers, offices, or even continents)
  • Enable parallel processing and distribution of workload across multiple nodes to improve performance and efficiency
  • Examples of distributed systems include cloud computing platforms (Amazon Web Services), peer-to-peer networks (BitTorrent), and distributed databases (Apache Cassandra)

Key Concepts and Principles

  • Scalability: Distributed systems can scale horizontally by adding more nodes to handle increased workload and accommodate growth
  • Fault Tolerance: Designed to continue functioning correctly even in the presence of failures (node crashes or network partitions)
  • Transparency: Hide the complexity of distribution from users, providing a seamless and unified view of the system
    • Location transparency: Users are unaware of the physical location of resources or services
    • Replication transparency: Users are unaware of the existence of replicated data or services
  • Concurrency: Multiple processes or threads execute simultaneously on different nodes, requiring coordination and synchronization mechanisms
  • Heterogeneity: Distributed systems can incorporate diverse hardware, software, and network technologies, enabling interoperability and integration
  • Decentralization: No single point of control or failure, decision-making and data management are distributed among nodes
  • Openness: Distributed systems often adhere to open standards and protocols, facilitating interoperability and extensibility

Distributed System Architectures

  • Client-Server Architecture: Clients send requests to servers, which process the requests and send back responses
    • Servers provide services or resources, while clients consume those services
    • Examples include web applications (client browsers interacting with web servers) and email systems (client email applications communicating with email servers)
  • Peer-to-Peer (P2P) Architecture: Nodes (peers) have equal roles and responsibilities, acting as both clients and servers
    • Peers directly communicate and share resources with each other without relying on a central server
    • Examples include file-sharing networks (Gnutella) and distributed computing projects (SETI@home)
  • Layered Architecture: Distributed system is organized into hierarchical layers, each layer providing services to the layer above and using services from the layer below
  • Object-Based Architecture: System is composed of interacting objects, each encapsulating data and behavior
    • Objects communicate through remote method invocations or message passing
    • Examples include distributed object middleware (CORBA) and actor-based systems (Akka)
  • Event-Driven Architecture: Components communicate and coordinate through the production and consumption of events
    • Events are typically asynchronous and can trigger actions or state changes in other components
    • Enables loose coupling and scalability, as components can be added or removed without affecting the entire system

Communication in Distributed Systems

  • Message Passing: Nodes communicate by sending and receiving messages over a network
    • Messages can be simple data structures or complex objects serialized for transmission
    • Communication can be synchronous (blocking) or asynchronous (non-blocking)
  • Remote Procedure Call (RPC): Allows a program to call procedures or methods on remote nodes as if they were local
    • RPC frameworks handle the marshalling and unmarshalling of parameters and return values
    • Examples include gRPC and Apache Thrift
  • Publish-Subscribe: Nodes publish messages to topics or channels, and other nodes subscribe to receive messages of interest
    • Enables decoupled communication, as publishers and subscribers do not need to be aware of each other
    • Examples include Apache Kafka and MQTT (Message Queuing Telemetry Transport)
  • Multicast: Sending a single message to multiple recipients simultaneously
    • Efficient for group communication and disseminating information to a subset of nodes
  • Gossip Protocols: Nodes periodically exchange information with a subset of randomly selected peers
    • Information propagates through the network in a decentralized and fault-tolerant manner
    • Used for disseminating updates, detecting failures, and maintaining consistency

Synchronization and Coordination

  • Distributed Locks: Mechanism to ensure exclusive access to shared resources across multiple nodes
    • Prevents race conditions and maintains data integrity
    • Examples include distributed lock managers (Apache ZooKeeper) and distributed key-value stores with locking support (etcd)
  • Distributed Transactions: Ensure atomicity, consistency, isolation, and durability (ACID) properties across multiple nodes
    • Two-Phase Commit (2PC): Coordinator node manages the transaction, ensuring all participants agree to commit or abort
    • Three-Phase Commit (3PC): Adds an additional phase to handle coordinator failures and network partitions
  • Consensus Algorithms: Enable multiple nodes to agree on a single value or state in the presence of failures and network partitions
    • Examples include Paxos, Raft, and Byzantine Fault Tolerance (BFT) algorithms
    • Used for leader election, state machine replication, and distributed decision-making
  • Vector Clocks: Mechanism for tracking causality and ordering events in a distributed system
    • Each node maintains a vector of logical clocks, incrementing its own clock and merging clocks from other nodes
    • Helps detect and resolve conflicts in distributed data updates

Consistency and Replication

  • Eventual Consistency: Replicas eventually converge to the same state, but there may be temporary inconsistencies
    • Suitable for systems where strong consistency is not critical and temporary inconsistencies are acceptable
    • Examples include DNS (Domain Name System) and NoSQL databases with eventual consistency models
  • Strong Consistency: All replicas always have the same view of the data, and updates are immediately visible to all nodes
    • Achieved through synchronous replication and strict ordering of operations
    • Examples include distributed databases with strong consistency guarantees (Google Spanner)
  • Quorum-Based Consistency: Reads and writes are performed on a subset (quorum) of replicas to ensure consistency
    • Quorum sizes are chosen to balance availability and consistency requirements
    • Examples include Dynamo-style databases (Apache Cassandra) and distributed file systems (Ceph)
  • Consistency Models: Define the guarantees and trade-offs between consistency, availability, and partition tolerance
    • CAP Theorem: A distributed system can provide at most two out of three properties: Consistency, Availability, and Partition Tolerance
    • PACELC: Extends CAP by considering the trade-offs between latency and consistency during normal operation (absence of partitions)
  • Replication Strategies: Techniques for maintaining multiple copies of data across nodes
    • Master-Slave Replication: One node (master) handles writes, and changes are propagated to slave nodes
    • Multi-Master Replication: Multiple nodes can accept writes, and changes are synchronized among them
    • Peer-to-Peer Replication: Each node maintains a copy of the data and synchronizes with other nodes

Fault Tolerance and Reliability

  • Redundancy: Replicating components, data, or services to handle failures and ensure availability
    • Active Replication: All replicas actively process requests and maintain the same state
    • Passive Replication: One replica (primary) handles requests, and others (backups) are kept in sync and take over if the primary fails
  • Checkpointing and Recovery: Periodically saving the state of a node or system to enable recovery from failures
    • Checkpoints serve as a snapshot of the system state at a particular point in time
    • Recovery involves restoring the system state from the most recent checkpoint and replaying any subsequent operations
  • Failure Detection: Mechanisms to detect and report node or component failures in a distributed system
    • Heartbeat Messages: Nodes periodically send "I am alive" messages to indicate their availability
    • Gossip-Based Failure Detection: Nodes exchange information about the status of other nodes they have communicated with
  • Fault-Tolerant Algorithms: Designed to continue functioning correctly in the presence of failures
    • Byzantine Fault Tolerance (BFT): Tolerates arbitrary failures, including malicious or compromised nodes
    • Paxos and Raft: Consensus algorithms that ensure agreement among nodes even in the presence of failures
  • Self-Healing and Automatic Recovery: Distributed systems can automatically detect and recover from failures without manual intervention
    • Examples include automatic failover mechanisms, self-healing clusters, and self-stabilizing algorithms

Real-World Applications and Examples

  • Distributed Databases: Store and manage data across multiple nodes for scalability, fault tolerance, and high availability
    • Examples include Apache Cassandra, Google Spanner, and Amazon DynamoDB
  • Distributed File Systems: Provide a unified view of files and directories across multiple nodes
    • Examples include Hadoop Distributed File System (HDFS), Google File System (GFS), and Ceph
  • Distributed Caching: Cache frequently accessed data across multiple nodes to improve performance and reduce load on backend systems
    • Examples include Memcached, Redis, and Apache Ignite
  • Distributed Computing Frameworks: Enable parallel processing of large-scale data and computations across a cluster of nodes
    • Examples include Apache Hadoop, Apache Spark, and Dryad
  • Blockchain and Cryptocurrencies: Decentralized systems that maintain a distributed ledger of transactions without a central authority
    • Examples include Bitcoin, Ethereum, and Hyperledger Fabric
  • Content Delivery Networks (CDNs): Distribute content (images, videos, etc.) across geographically dispersed nodes to improve performance and availability
    • Examples include Akamai, Cloudflare, and Amazon CloudFront
  • Distributed Messaging Systems: Enable reliable and scalable message passing and event-driven communication between distributed components
    • Examples include Apache Kafka, RabbitMQ, and ZeroMQ


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.