Distributed systems are complex networks of interconnected computers that work together as a unified system. They're designed to handle massive workloads, provide seamless access to resources, and maintain functionality even when parts of the system fail.
Key characteristics of distributed systems include scalability, transparency, and fault tolerance. These systems can be structured in various ways, such as client-server, peer-to-peer, or hybrid architectures, each with its own strengths and weaknesses for different use cases.
Key Characteristics and Architectural Models
Characteristics of distributed systems
- Scalability enables handling increasing amounts of work or accommodating growth
- Horizontal scalability achieved by adding more nodes to the system (clusters)
- Vertical scalability achieved by increasing the capacity of individual nodes (upgrading hardware)
- Transparency hides the complexity and distributed nature of the system from users and developers
- Access transparency allows accessing local and remote resources using the same operations (distributed file systems)
- Location transparency allows accessing resources without knowledge of their physical or network location (content delivery networks)
- Migration transparency allows moving resources without affecting the operation of users or other resources (virtual machine migration)
- Replication transparency maintains consistency among replicated resources (distributed databases)
- Concurrency transparency allows multiple processes to operate concurrently using shared resources without interference (distributed locking)
- Failure transparency masks or tolerates faults without impacting the overall system (redundant components)
- Fault tolerance enables continuing to operate correctly in the presence of faults or failures
- Redundancy replicates components or data to handle failures (backup servers)
- Checkpointing periodically saves the state of a process to enable recovery (snapshot)
- Logging records events and messages to facilitate recovery and debugging (transaction logs)
- Crash failures occur when a component stops functioning completely (power outage)
- Omission failures occur when a component fails to respond to some inputs (network packet loss)
- Timing failures occur when a component responds too early, too late, or not at all (network congestion)
- Byzantine failures occur when a component exhibits arbitrary or malicious behavior (compromised node)
Architectural models comparison
- Client-server architecture follows a centralized model with distinct roles for clients and servers
- Clients send requests to servers, which process the requests and send responses back to clients
- Centralized control and management simplifies consistency and synchronization
- Suitable for systems with well-defined and stable interactions (web applications)
- Single point of failure (server) and limited scalability due to server bottlenecks are disadvantages
- Increased network traffic and latency compared to decentralized models
- Peer-to-peer (P2P) architecture follows a decentralized model where each node acts as both a client and a server
- Nodes communicate and collaborate directly with each other without a central authority
- High scalability and fault tolerance due to the distributed nature of the system
- Efficient resource utilization and load balancing across nodes (file sharing networks)
- Reduced network traffic and latency compared to centralized models
- Complex coordination and synchronization among peers
- Difficulty in ensuring data consistency and integrity across nodes
- Security and trust issues among peers due to the lack of central authority
- Hybrid architectures combine client-server and P2P models, leveraging the strengths of both
- Edge computing processes data closer to the source (edge) to reduce latency and bandwidth usage (IoT devices)
- Fog computing distributes computation, storage, and networking services between the cloud and edge devices (smart cities)
- Flexibility and adaptability to different system requirements by combining centralized and decentralized approaches
- Improved performance and efficiency compared to pure client-server or P2P models
- Increased complexity in design, implementation, and management of hybrid systems
- Potential for compatibility and interoperability issues between different components
Consistency Models and Synchronization
Consistency models in distribution
- Strong consistency ensures all nodes always see the same data at the same time
- Achieved through synchronous replication and strict coordination among nodes
- Simplifies application development and reasoning about data consistency
- Ensures data integrity and prevents conflicts caused by concurrent updates
- Reduced availability and performance due to the synchronization overhead
- Not suitable for systems with high write contention or network partitions (financial transactions)
- Eventual consistency allows nodes to eventually see the same data, but there may be temporary inconsistencies
- Achieved through asynchronous replication and optimistic concurrency control
- High availability and performance, even in the presence of network partitions
- Suitable for systems with low write contention and high read scalability (social media feeds)
- Complexity in application development and handling temporary inconsistencies
- Potential for conflicts and data loss in case of concurrent updates
- Causal consistency ensures that causally related operations are seen in the same order by all nodes
- Captures the notion of cause and effect between events in a distributed system
- Provides a stronger consistency model than eventual consistency while maintaining availability
- Suitable for systems with causal dependencies and asynchronous communication (collaborative editing)
- Increased complexity in tracking and enforcing causal relationships among operations
- Higher overhead compared to eventual consistency due to metadata management
Synchronization in distributed systems
- Clock synchronization ensures that all nodes in a distributed system have a consistent view of time
- Clock drift occurs when clocks on different nodes run at slightly different rates
- Network delays affect the accuracy of clock synchronization due to communication latency
- Cristian's algorithm synchronizes client clocks with a time server using a client-server approach
- Berkeley algorithm averages clock values exchanged among nodes to adjust their clocks
- Network Time Protocol (NTP) synchronizes clocks over the Internet using a hierarchical approach
- Distributed consensus achieves agreement among nodes in a distributed system on a single value or decision
- Node failures, network partitions, and asynchronous communication pose challenges to consensus
- Paxos family of protocols solves consensus in the presence of failures
- Basic Paxos reaches consensus on a single value
- Multi-Paxos extends Basic Paxos to achieve consensus on a sequence of values
- Raft offers a more understandable and implementable alternative to Paxos
- Leader election: nodes elect a leader to coordinate consensus
- Log replication: the leader replicates its log to follower nodes
- Safety and liveness properties: Raft ensures that the system remains consistent and makes progress
- Byzantine fault-tolerant (BFT) algorithms handle nodes that exhibit arbitrary or malicious behavior
- Practical Byzantine Fault Tolerance (PBFT) implements a state machine replication protocol for Byzantine environments
- Byzantine Paxos extends Paxos to tolerate Byzantine failures