๐Ÿ’ปParallel and Distributed Computing

Key Concepts of Distributed File Systems

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Distributed file systems are the backbone of modern computing infrastructure, from cloud services to the massive data pipelines powering machine learning and scientific research. When you're tested on parallel and distributed computing, you need to understand not just what these systems are, but why they're architected the way they are. Exams will probe your understanding of fault tolerance, scalability, consistency models, and the fundamental trade-offs every distributed system must navigate.

These file systems illustrate core principles like replication strategies, metadata management, caching mechanisms, and parallel I/O patterns. Each system makes different design choices to optimize for specific workloads: some prioritize throughput for massive batch processing, others emphasize low-latency access for interactive applications. Don't just memorize system names. Know what architectural pattern each one demonstrates and when you'd choose one approach over another.


Master-Based Architectures for Big Data

These systems use a centralized master node to manage metadata while distributing actual data across worker nodes. This architecture simplifies consistency but creates a potential single point of failure that must be mitigated through replication and failover mechanisms.

Google File System (GFS)

  • Master-slave architecture: a single master handles all metadata operations (namespace, chunk locations, access control) while chunkservers store 64 MB data chunks across the cluster
  • Replication factor of 3 ensures fault tolerance; chunks are automatically re-replicated when the master detects node failures via heartbeat messages
  • Optimized for append-heavy workloads typical of Google's batch processing, prioritizing high sustained throughput over low latency. GFS introduced a relaxed consistency model where concurrent appends are guaranteed to succeed at least once but may produce duplicates or padding, which the application layer is expected to handle.

Hadoop Distributed File System (HDFS)

  • NameNode/DataNode split: the NameNode maintains the entire filesystem namespace and block-to-DataNode mappings in memory for fast lookups. Because this metadata lives in RAM, the NameNode's memory capacity limits the total number of files the cluster can store.
  • Write-once-read-many (WORM) model simplifies consistency. Once a file is closed, its existing blocks cannot be modified (only appended to or deleted). This avoids complex concurrent-write coordination.
  • Data locality optimization moves computation to data rather than moving data to computation. The JobTracker (or YARN ResourceManager) schedules map tasks on nodes that already hold the relevant blocks, which is critical for MapReduce efficiency.

Compare: GFS vs. HDFS: both use master-based architectures with chunk/block replication, but HDFS is the open-source implementation designed for the Hadoop ecosystem. GFS uses 64 MB chunks; HDFS defaults to 128 MB blocks (increased from 64 MB in later versions) to reduce NameNode memory pressure. If a question asks about big data processing architectures, HDFS is your go-to example since it's more widely documented.


Traditional Network File Sharing

These systems prioritize transparent remote file access over massive scale, using client-server models that make remote files appear local to applications.

Network File System (NFS)

  • Stateless protocol design (in NFSv3) means servers don't track which clients have which files open. This simplifies crash recovery because the server can restart without reconstructing session state, but clients must retry failed operations themselves.
  • POSIX-compliant interface allows applications to use standard file operations (open, read, write, close) without modification.
  • Best for LAN environments where low latency is achievable. Performance degrades significantly over WANs because NFS clients frequently poll the server to check whether cached data is still fresh.

Note: NFSv4 moved to a stateful design with features like delegations and compound RPCs, addressing some of the scalability limitations of v3. Be aware of which version a question references.

Andrew File System (AFS)

  • Whole-file caching on the client's local disk reduces network traffic dramatically. When a client opens a file, the entire file is transferred and cached locally. All reads and writes happen against this local copy.
  • Callback mechanism keeps caches consistent. The server issues a callback promise to each client that caches a file. When any client writes back a modified copy, the server breaks callbacks on all other clients, notifying them their cached copy is stale.
  • Location transparency through a global namespace (/afs/...) means users access files by logical path regardless of which physical server stores them.

Compare: NFS vs. AFS: NFS caches at the block level and checks freshness frequently (attribute caching with short timeouts), while AFS caches entire files and uses server-initiated callbacks. AFS scales better across WANs due to reduced network round-trips, making it the better choice for geographically distributed organizations. The trade-off is that AFS has higher latency on first access (whole-file transfer) and uses session semantics where changes only become visible to other clients when the file is closed.


Decentralized and Self-Healing Systems

These modern architectures eliminate single points of failure by distributing metadata across the cluster. They use algorithms like consistent hashing and consensus protocols to maintain coherence without a central coordinator.

Ceph File System

  • CRUSH algorithm (Controlled Replication Under Scalable Hashing) computes data placement deterministically from the object name and a cluster map. Any node can calculate where data lives without consulting a central lookup table, which removes a major scalability bottleneck.
  • Unified storage model provides object (RADOS), block (RBD), and file (CephFS) interfaces from the same underlying cluster, so you don't need separate systems for different storage needs.
  • Self-healing replication automatically detects node failures through peer monitoring and re-replicates affected data to restore the target redundancy level without administrator intervention.

GlusterFS

  • Stackable translator architecture allows flexible composition of features like replication, striping, distribution, and caching as modular layers. You configure a volume by stacking the translators you need.
  • No dedicated metadata server: metadata is stored alongside data using an elastic hashing algorithm to map filenames to storage locations. This means the filesystem scales linearly as you add more storage nodes, with no metadata bottleneck.
  • Elastic volume management enables adding or removing storage bricks (individual storage exports from a node) while the system remains online, rebalancing data in the background.

Compare: Ceph vs. GlusterFS: both are decentralized and run on commodity hardware, but Ceph uses the more sophisticated CRUSH algorithm for placement while GlusterFS relies on a distributed hash table approach. Ceph offers more storage interfaces (object, block, and file); GlusterFS is simpler to deploy for pure file storage needs. Ceph also separates its monitor daemons (which maintain cluster state via Paxos consensus) from its storage daemons, giving it stronger consistency guarantees at the cost of more operational complexity.


High-Performance Computing (HPC) Systems

Designed for scientific computing and analytics workloads, these systems maximize parallel I/O throughput. They use aggressive striping and parallel metadata operations to saturate high-speed interconnects.

Lustre File System

  • Separate metadata and object storage servers (MDS and OSS) allow both to scale independently. You can add more OSSes for bandwidth or more MDSes for metadata-heavy workloads (recent Lustre versions support distributed metadata via DNE).
  • Data striping across OSTs (Object Storage Targets) splits single files across multiple storage targets. A single large file can achieve the aggregate read/write bandwidth of all the OSTs it's striped across, which is how Lustre delivers the I/O rates HPC applications demand.
  • Dominant in supercomputing: Lustre powers many of the world's fastest systems, handling petabytes of data across thousands of nodes.

IBM General Parallel File System (GPFS / Spectrum Scale)

  • Distributed metadata across multiple nodes eliminates the single-MDS bottleneck. Any node in the cluster can serve metadata requests, which improves both throughput and resilience.
  • Token-based distributed locking provides byte-range locking with strong consistency. Multiple nodes can write to different regions of the same file simultaneously without conflicts, and the token manager coordinates access only when ranges overlap.
  • Tiered storage support automatically migrates data between fast storage (SSD) and slower storage (HDD/tape) based on access patterns and administrator-defined policies.

Compare: Lustre vs. GPFS: both target HPC workloads, but Lustre historically used centralized metadata (simpler but less scalable for metadata-heavy workloads) while GPFS distributes metadata from the start. GPFS offers more enterprise features like snapshots, encryption, and policy-driven tiering. Lustre tends to dominate in raw sequential I/O benchmarks and is open-source, making it the default choice for many national labs and supercomputing centers.


Cloud-Native Managed Services

These fully managed offerings abstract away infrastructure complexity, automatically handling scaling, replication, and failover. They trade some control and customization for operational simplicity and tight integration with cloud ecosystems.

Amazon Elastic File System (EFS)

  • Automatic elastic scaling grows and shrinks storage capacity as files are added or removed, with no provisioning required. You pay only for what you store.
  • NFSv4.1 protocol ensures compatibility with existing Linux applications without code changes.
  • Performance modes let you choose between General Purpose (optimized for latency-sensitive operations like web serving) and Max I/O (optimized for aggregate throughput with thousands of concurrent clients). There are also throughput modes (Bursting vs. Provisioned) that control how much bandwidth you get.

Microsoft Azure Files

  • SMB protocol support (and NFSv4.1 for Linux workloads) enables Windows applications and lift-and-shift migrations from on-premises file servers.
  • Azure AD integration provides identity-based access control using existing enterprise credentials, which simplifies permission management for organizations already using Azure Active Directory.
  • Hybrid scenarios supported through Azure File Sync, which caches frequently accessed cloud files on on-premises servers while tiering cold data to the cloud.

Compare: EFS vs. Azure Files: EFS uses NFS (Linux-native) while Azure Files primarily uses SMB (Windows-native), though Azure Files now also supports NFS. Both auto-scale and integrate with their respective cloud ecosystems. Choose based on your application's protocol requirements, your existing cloud infrastructure, and whether you need hybrid on-premises caching (Azure File Sync has no direct EFS equivalent).


Quick Reference Table

ConceptBest Examples
Master-based metadata managementGFS, HDFS, Lustre
Decentralized/no single point of failureCeph, GlusterFS
Whole-file caching for WAN efficiencyAFS
Block-level caching for LAN accessNFS
HPC/parallel I/O optimizationLustre, GPFS
Cloud-managed auto-scalingEFS, Azure Files
Unified object/block/file storageCeph
Write-once-read-many consistencyHDFS

Self-Check Questions

  1. Which two distributed file systems eliminate single points of failure through decentralized architectures, and what algorithms do they use for data placement?

  2. Compare the caching strategies of NFS and AFS. Which is better suited for a geographically distributed organization, and why?

  3. Both GFS and HDFS use master-based architectures. What consistency model does HDFS implement, and how does this simplify its design?

  4. If you needed to support both Linux (NFS) and Windows (SMB) clients in a cloud environment, which managed services would you consider, and what trade-offs would you evaluate?

  5. You're asked to design a storage system for a scientific computing cluster processing petabyte-scale simulation data. Which architectural features from Lustre or GPFS would you incorporate, and how do they achieve high parallel throughput?