upgrade
upgrade

💻Parallel and Distributed Computing

Key Concepts of Distributed File Systems

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Distributed file systems are the backbone of modern computing infrastructure—from the cloud services you use daily to the massive data pipelines powering machine learning and scientific research. When you're tested on parallel and distributed computing, you're expected to understand not just what these systems are, but why they're architected the way they are. The exam will probe your understanding of fault tolerance, scalability, consistency models, and the fundamental trade-offs every distributed system must navigate.

These file systems illustrate core principles like replication strategies, metadata management, caching mechanisms, and parallel I/O patterns. Each system makes different design choices to optimize for specific workloads—some prioritize throughput for massive batch processing, others emphasize low-latency access for interactive applications. Don't just memorize system names—know what architectural pattern each one demonstrates and when you'd choose one approach over another.


Master-Based Architectures for Big Data

These systems use a centralized master node to manage metadata while distributing actual data across worker nodes. This architecture simplifies consistency but creates a potential single point of failure that must be mitigated through replication and failover mechanisms.

Google File System (GFS)

  • Master-slave architecture—a single master handles all metadata operations while chunkservers store 64MB data chunks across the cluster
  • Replication factor of 3 ensures fault tolerance; chunks are automatically re-replicated when nodes fail
  • Optimized for append-heavy workloads typical of Google's batch processing, prioritizing throughput over low latency

Hadoop Distributed File System (HDFS)

  • NameNode/DataNode split—the NameNode maintains the filesystem namespace and block mappings in memory for fast lookups
  • Write-once-read-many model simplifies consistency; files cannot be modified after creation, only appended or deleted
  • Data locality optimization moves computation to data rather than moving data to computation, critical for MapReduce efficiency

Compare: GFS vs. HDFS—both use master-based architectures with chunk/block replication, but HDFS is the open-source implementation designed for the Hadoop ecosystem. If an FRQ asks about big data processing architectures, HDFS is your go-to example since it's more widely documented.


Traditional Network File Sharing

These systems prioritize transparent remote file access over massive scale, using client-server models that make remote files appear local to applications.

Network File System (NFS)

  • Stateless protocol design (in NFSv3) means servers don't track client state, simplifying crash recovery but requiring clients to retry failed operations
  • POSIX-compliant interface allows applications to use standard file operations without modification
  • Best for LAN environments where low latency is achievable; performance degrades significantly over WANs

Andrew File System (AFS)

  • Whole-file caching on clients reduces network traffic dramatically—files are cached locally and only updated when the server copy changes
  • Callback mechanism notifies clients when cached files become stale, balancing consistency with performance
  • Location transparency through a global namespace means users access files by logical path regardless of physical server location

Compare: NFS vs. AFS—NFS caches at the block level and checks freshness frequently, while AFS caches entire files and uses callbacks. AFS scales better across WANs due to reduced network round-trips, making it the better choice for geographically distributed organizations.


Decentralized and Self-Healing Systems

These modern architectures eliminate single points of failure by distributing metadata across the cluster. They use algorithms like consistent hashing and consensus protocols to maintain coherence without a central coordinator.

Ceph File System

  • CRUSH algorithm computes data placement deterministically, eliminating the need for centralized lookup tables
  • Unified storage model provides object, block, and file interfaces from the same underlying cluster
  • Self-healing replication automatically detects and recovers from node failures without administrator intervention

GlusterFS

  • Stackable translator architecture allows flexible composition of features like replication, striping, and caching
  • No metadata server means the filesystem scales linearly—just add more storage nodes without bottlenecks
  • Elastic volume management enables adding or removing storage bricks while the system remains online

Compare: Ceph vs. GlusterFS—both are decentralized and run on commodity hardware, but Ceph uses a more sophisticated placement algorithm (CRUSH) while GlusterFS relies on distributed hash tables. Ceph offers more storage interfaces; GlusterFS is simpler to deploy for pure file storage needs.


High-Performance Computing (HPC) Systems

Designed for scientific computing and analytics workloads, these systems maximize parallel I/O throughput. They use aggressive striping and parallel metadata operations to saturate high-speed interconnects.

Lustre File System

  • Separate metadata and object storage servers (MDS and OSS) allow both to scale independently based on workload characteristics
  • Data striping across OSTs enables single files to achieve aggregate bandwidth of multiple storage targets
  • Dominant in supercomputing—powers many of the world's fastest systems, handling petabytes of data across thousands of nodes

IBM General Parallel File System (GPFS)

  • Distributed metadata across multiple nodes eliminates the single-MDS bottleneck found in Lustre
  • Token-based locking provides strong consistency while allowing parallel access to different file regions
  • Tiered storage support automatically migrates data between fast and slow storage based on access patterns

Compare: Lustre vs. GPFS—both target HPC workloads, but Lustre uses centralized metadata (simpler but less scalable) while GPFS distributes metadata (more complex but eliminates bottlenecks). GPFS offers more enterprise features; Lustre dominates in raw HPC performance benchmarks.


Cloud-Native Managed Services

These fully managed offerings abstract away infrastructure complexity, automatically handling scaling, replication, and failover. They trade some control and customization for operational simplicity and integration with cloud ecosystems.

Amazon Elastic File System (EFS)

  • Automatic elastic scaling grows and shrinks capacity as files are added or removed—no provisioning required
  • NFS v4 protocol ensures compatibility with existing Linux applications without code changes
  • Performance modes let you choose between general-purpose (latency-sensitive) and Max I/O (throughput-optimized) based on workload

Microsoft Azure Files

  • SMB protocol support enables Windows applications and lift-and-shift migrations from on-premises file servers
  • Azure AD integration provides identity-based access control using existing enterprise credentials
  • Hybrid scenarios supported through Azure File Sync, which caches cloud files on on-premises servers

Compare: EFS vs. Azure Files—EFS uses NFS (Linux-native) while Azure Files uses SMB (Windows-native). Both auto-scale and integrate with their respective cloud ecosystems. Choose based on your application's protocol requirements and existing cloud infrastructure.


Quick Reference Table

ConceptBest Examples
Master-based metadata managementGFS, HDFS, Lustre
Decentralized/no single point of failureCeph, GlusterFS
Whole-file caching for WAN efficiencyAFS
Block-level caching for LAN accessNFS
HPC/parallel I/O optimizationLustre, GPFS
Cloud-managed auto-scalingEFS, Azure Files
Unified object/block/file storageCeph
Write-once-read-many consistencyHDFS

Self-Check Questions

  1. Which two distributed file systems eliminate single points of failure through decentralized architectures, and what algorithms do they use for data placement?

  2. Compare the caching strategies of NFS and AFS—which is better suited for a geographically distributed organization, and why?

  3. Both GFS and HDFS use master-based architectures. What consistency model does HDFS implement, and how does this simplify its design?

  4. If you needed to support both Linux (NFS) and Windows (SMB) clients in a cloud environment, which managed services would you consider, and what trade-offs would you evaluate?

  5. An FRQ asks you to design a storage system for a scientific computing cluster processing petabyte-scale simulation data. Which architectural features from Lustre or GPFS would you incorporate, and how do they achieve high parallel throughput?