upgrade
upgrade

🤝Collaborative Data Science

Key Concepts in Big Data Processing Frameworks

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

When you're working with datasets that don't fit on a single machine—or when your analysis needs to run faster than sequential processing allows—you need distributed computing frameworks. These tools aren't just about handling "big" data; they're about understanding the fundamental trade-offs between latency, throughput, fault tolerance, and ease of use that shape every data science pipeline you'll build in collaborative environments.

You're being tested on your ability to choose the right tool for a given problem and explain why it fits. Can you articulate when batch processing beats streaming? Do you understand why some frameworks require a cluster while others scale from your laptop? Don't just memorize framework names—know what processing paradigm each represents and when you'd reach for it in a real project.


Batch Processing Foundations

These frameworks process data in large chunks, optimizing for throughput over latency. Batch processing assumes your data is bounded—you know when the input ends—and prioritizes complete, correct results over speed.

Apache Hadoop

  • HDFS (Hadoop Distributed File System)—provides fault-tolerant storage by replicating data blocks across multiple nodes in a cluster
  • MapReduce paradigm enables parallel computation by breaking jobs into independent map tasks followed by aggregating reduce tasks
  • Scalability through commodity hardware—designed to run on inexpensive machines, making it foundational for cost-effective big data infrastructure

Apache Spark

  • In-memory processing—keeps intermediate results in RAM rather than writing to disk, achieving up to 100x speedup over MapReduce for iterative algorithms
  • Unified API supports batch, streaming, SQL queries, and machine learning through libraries like MLlib and Spark SQL
  • Lazy evaluation with DAG (Directed Acyclic Graph) optimization—Spark builds an execution plan before running, enabling automatic performance tuning

Compare: Hadoop vs. Spark—both handle distributed batch processing, but Hadoop writes intermediate results to disk while Spark keeps them in memory. If an assignment asks about iterative machine learning algorithms, Spark is your go-to example because repeated disk I/O kills performance.


Stream Processing Engines

Stream processors handle unbounded data—continuous flows where you never know when (or if) the input ends. The key challenge is maintaining state and ensuring exactly-once processing semantics while data keeps arriving.

  • True streaming architecture—processes events one at a time as they arrive, rather than collecting them into micro-batches
  • Event time processing handles out-of-order data using watermarks, critical for applications where network delays scramble message order
  • Stateful computations with checkpointing enable exactly-once semantics even when failures occur mid-stream

Apache Storm

  • Topology-based architecture—data flows through a directed graph of spouts (data sources) and bolts (processing nodes)
  • Sub-second latency makes it ideal for real-time monitoring, fraud detection, and alerting systems
  • At-least-once processing by default—simpler guarantees than Flink, but requires your application to handle potential duplicates

Apache Samza

  • Tight Kafka integration—designed specifically to consume from and produce to Kafka topics, simplifying stream processing pipelines
  • Local state stores allow stateful processing without external databases, using RocksDB for persistent key-value storage
  • Job isolation—each Samza job runs independently, making it easier to reason about failures and scaling

Compare: Flink vs. Storm—both process streams in real-time, but Flink offers stronger guarantees (exactly-once semantics) and native event-time handling. Storm is simpler and lower-latency but pushes more complexity to the application developer. For reproducible data science, Flink's guarantees often matter more.


Messaging and Data Integration

These systems don't process data themselves—they move it reliably between producers and consumers. Think of them as the nervous system connecting your distributed applications.

Apache Kafka

  • Distributed commit log—messages are persisted to disk and replicated across brokers, enabling fault-tolerant, high-throughput streaming
  • Consumer groups allow multiple applications to read the same data stream independently, each maintaining their own offset position
  • Publish-subscribe and queue patterns—flexible enough to serve as both a message queue and a broadcast system depending on configuration

Compare: Kafka vs. traditional message queues—Kafka retains messages after consumption (configurable retention period), allowing replay and multiple consumers. This is essential for reproducible pipelines where you might need to reprocess historical data after fixing a bug.


Unified and Portable Frameworks

These tools abstract away the underlying execution engine, letting you write code once and run it anywhere. Portability matters for collaborative work—your pipeline shouldn't break when a teammate uses a different cluster.

Apache Beam

  • Write-once, run-anywhere model—define pipelines using Beam's SDK, then execute on Spark, Flink, Dataflow, or other runners
  • Unified batch and streaming through the same API—your code handles both bounded and unbounded data without rewrites
  • Windowing and triggers provide fine-grained control over how streaming data gets grouped and when results are emitted

Google Cloud Dataflow

  • Fully managed execution—no cluster provisioning or capacity planning; resources scale automatically based on workload
  • Built on Apache Beam—same programming model, but with automatic optimization and integration with BigQuery, Pub/Sub, and other GCP services
  • Reproducibility features include job templates and version-controlled pipeline definitions for collaborative workflows

Compare: Beam vs. Dataflow—Beam is the open-source programming model; Dataflow is Google's managed service that runs Beam pipelines. In collaborative projects, Beam gives you portability while Dataflow reduces operational overhead.


Interactive and Analytical Querying

Sometimes you need answers fast—these tools optimize for low-latency queries over large datasets, enabling exploratory analysis and business intelligence.

Presto

  • Federated queries—join data across Hadoop, S3, PostgreSQL, and other sources in a single SQL statement without moving data
  • In-memory execution with pipelining—intermediate results flow between operators without materialization to disk
  • ANSI SQL support makes it accessible to analysts who know SQL but not distributed systems programming

Dask

  • Python-native parallelism—extends familiar NumPy and Pandas interfaces to datasets larger than memory
  • Dynamic task scheduling builds computation graphs lazily and executes them efficiently across available cores or cluster nodes
  • Seamless scaling—develop on your laptop with small data, then deploy the same code to a cluster for production workloads

Compare: Presto vs. Dask—Presto excels at SQL queries across diverse data sources; Dask is better when you need programmatic data manipulation in Python. For reproducible notebooks, Dask integrates more naturally with Jupyter workflows.


Quick Reference Table

ConceptBest Examples
Batch processing (disk-based)Hadoop MapReduce
Batch processing (in-memory)Spark
True stream processingFlink, Storm, Samza
Message streaming / integrationKafka
Portable pipeline definitionsBeam, Dataflow
Interactive SQL queriesPresto
Python-native parallelismDask
Managed cloud servicesGoogle Cloud Dataflow

Self-Check Questions

  1. Which two frameworks both support stream processing but differ in their delivery guarantees—and when would the simpler guarantee be acceptable?

  2. If you needed to reprocess six months of historical event data after discovering a bug in your pipeline, which messaging system's architecture makes this possible, and why?

  3. Compare Spark and Hadoop: what specific architectural difference explains Spark's performance advantage for iterative machine learning algorithms?

  4. You're building a data pipeline that needs to run on your team's local Spark cluster today but might move to Google Cloud next quarter. Which framework would you choose to write the pipeline, and what's the key benefit?

  5. A collaborator asks whether to use Presto or Dask for exploring a 500GB dataset. What questions would you ask to help them decide, and what does each tool optimize for?