💻Parallel and Distributed Computing

Key Concepts of Distributed File Systems

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Distributed file systems are the backbone of modern computing infrastructure—from the cloud services you use daily to the massive data pipelines powering machine learning and scientific research. When you're tested on parallel and distributed computing, you're expected to understand not just what these systems are, but why they're architected the way they are. The exam will probe your understanding of fault tolerance, scalability, consistency models, and the fundamental trade-offs every distributed system must navigate.

These file systems illustrate core principles like replication strategies, metadata management, caching mechanisms, and parallel I/O patterns. Each system makes different design choices to optimize for specific workloads—some prioritize throughput for massive batch processing, others emphasize low-latency access for interactive applications. Don't just memorize system names—know what architectural pattern each one demonstrates and when you'd choose one approach over another.

Master-Based Architectures for Big Data

These systems use a centralized master node to manage metadata while distributing actual data across worker nodes. This architecture simplifies consistency but creates a potential single point of failure that must be mitigated through replication and failover mechanisms.

Google File System (GFS)

Master-slave architecture—a single master handles all metadata operations while chunkservers store 64MB data chunks across the cluster
Replication factor of 3 ensures fault tolerance; chunks are automatically re-replicated when nodes fail
Optimized for append-heavy workloads typical of Google's batch processing, prioritizing throughput over low latency

Hadoop Distributed File System (HDFS)

NameNode/DataNode split—the NameNode maintains the filesystem namespace and block mappings in memory for fast lookups
Write-once-read-many model simplifies consistency; files cannot be modified after creation, only appended or deleted
Data locality optimization moves computation to data rather than moving data to computation, critical for MapReduce efficiency

Compare: GFS vs. HDFS—both use master-based architectures with chunk/block replication, but HDFS is the open-source implementation designed for the Hadoop ecosystem. If an FRQ asks about big data processing architectures, HDFS is your go-to example since it's more widely documented.

These systems prioritize transparent remote file access over massive scale, using client-server models that make remote files appear local to applications.

Network File System (NFS)

Stateless protocol design (in NFSv3) means servers don't track client state, simplifying crash recovery but requiring clients to retry failed operations
POSIX-compliant interface allows applications to use standard file operations without modification
Best for LAN environments where low latency is achievable; performance degrades significantly over WANs

Andrew File System (AFS)

Whole-file caching on clients reduces network traffic dramatically—files are cached locally and only updated when the server copy changes
Callback mechanism notifies clients when cached files become stale, balancing consistency with performance
Location transparency through a global namespace means users access files by logical path regardless of physical server location

Compare: NFS vs. AFS—NFS caches at the block level and checks freshness frequently, while AFS caches entire files and uses callbacks. AFS scales better across WANs due to reduced network round-trips, making it the better choice for geographically distributed organizations.

Decentralized and Self-Healing Systems

These modern architectures eliminate single points of failure by distributing metadata across the cluster. They use algorithms like consistent hashing and consensus protocols to maintain coherence without a central coordinator.

Ceph File System

CRUSH algorithm computes data placement deterministically, eliminating the need for centralized lookup tables
Unified storage model provides object, block, and file interfaces from the same underlying cluster
Self-healing replication automatically detects and recovers from node failures without administrator intervention

GlusterFS

Stackable translator architecture allows flexible composition of features like replication, striping, and caching
No metadata server means the filesystem scales linearly—just add more storage nodes without bottlenecks
Elastic volume management enables adding or removing storage bricks while the system remains online

Compare: Ceph vs. GlusterFS—both are decentralized and run on commodity hardware, but Ceph uses a more sophisticated placement algorithm (CRUSH) while GlusterFS relies on distributed hash tables. Ceph offers more storage interfaces; GlusterFS is simpler to deploy for pure file storage needs.

High-Performance Computing (HPC) Systems

Designed for scientific computing and analytics workloads, these systems maximize parallel I/O throughput. They use aggressive striping and parallel metadata operations to saturate high-speed interconnects.

Lustre File System

Separate metadata and object storage servers (MDS and OSS) allow both to scale independently based on workload characteristics
Data striping across OSTs enables single files to achieve aggregate bandwidth of multiple storage targets
Dominant in supercomputing—powers many of the world's fastest systems, handling petabytes of data across thousands of nodes

IBM General Parallel File System (GPFS)

Distributed metadata across multiple nodes eliminates the single-MDS bottleneck found in Lustre
Token-based locking provides strong consistency while allowing parallel access to different file regions
Tiered storage support automatically migrates data between fast and slow storage based on access patterns

Compare: Lustre vs. GPFS—both target HPC workloads, but Lustre uses centralized metadata (simpler but less scalable) while GPFS distributes metadata (more complex but eliminates bottlenecks). GPFS offers more enterprise features; Lustre dominates in raw HPC performance benchmarks.

Cloud-Native Managed Services

These fully managed offerings abstract away infrastructure complexity, automatically handling scaling, replication, and failover. They trade some control and customization for operational simplicity and integration with cloud ecosystems.

Amazon Elastic File System (EFS)

Automatic elastic scaling grows and shrinks capacity as files are added or removed—no provisioning required
NFS v4 protocol ensures compatibility with existing Linux applications without code changes
Performance modes let you choose between general-purpose (latency-sensitive) and Max I/O (throughput-optimized) based on workload

Microsoft Azure Files

SMB protocol support enables Windows applications and lift-and-shift migrations from on-premises file servers
Azure AD integration provides identity-based access control using existing enterprise credentials
Hybrid scenarios supported through Azure File Sync, which caches cloud files on on-premises servers

Compare: EFS vs. Azure Files—EFS uses NFS (Linux-native) while Azure Files uses SMB (Windows-native). Both auto-scale and integrate with their respective cloud ecosystems. Choose based on your application's protocol requirements and existing cloud infrastructure.

Quick Reference Table

Concept	Best Examples
Master-based metadata management	GFS, HDFS, Lustre
Decentralized/no single point of failure	Ceph, GlusterFS
Whole-file caching for WAN efficiency	AFS
Block-level caching for LAN access	NFS
HPC/parallel I/O optimization	Lustre, GPFS
Cloud-managed auto-scaling	EFS, Azure Files
Unified object/block/file storage	Ceph
Write-once-read-many consistency	HDFS

Self-Check Questions

Which two distributed file systems eliminate single points of failure through decentralized architectures, and what algorithms do they use for data placement?
Compare the caching strategies of NFS and AFS—which is better suited for a geographically distributed organization, and why?
Both GFS and HDFS use master-based architectures. What consistency model does HDFS implement, and how does this simplify its design?
If you needed to support both Linux (NFS) and Windows (SMB) clients in a cloud environment, which managed services would you consider, and what trade-offs would you evaluate?
An FRQ asks you to design a storage system for a scientific computing cluster processing petabyte-scale simulation data. Which architectural features from Lustre or GPFS would you incorporate, and how do they achieve high parallel throughput?

💻Parallel and Distributed Computing

Key Concepts of Distributed File Systems

Why This Matters

Master-Based Architectures for Big Data

Google File System (GFS)

Hadoop Distributed File System (HDFS)

Traditional Network File Sharing

Network File System (NFS)

Andrew File System (AFS)

Decentralized and Self-Healing Systems

Ceph File System

GlusterFS

High-Performance Computing (HPC) Systems

Lustre File System

IBM General Parallel File System (GPFS)

Cloud-Native Managed Services

Amazon Elastic File System (EFS)

Microsoft Azure Files

Quick Reference Table

Self-Check Questions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

hs classes