Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
When you're working with datasets that don't fit on a single machine—or when your analysis needs to run faster than sequential processing allows—you need distributed computing frameworks. These tools aren't just about handling "big" data; they're about understanding the fundamental trade-offs between latency, throughput, fault tolerance, and ease of use that shape every data science pipeline you'll build in collaborative environments.
You're being tested on your ability to choose the right tool for a given problem and explain why it fits. Can you articulate when batch processing beats streaming? Do you understand why some frameworks require a cluster while others scale from your laptop? Don't just memorize framework names—know what processing paradigm each represents and when you'd reach for it in a real project.
These frameworks process data in large chunks, optimizing for throughput over latency. Batch processing assumes your data is bounded—you know when the input ends—and prioritizes complete, correct results over speed.
Compare: Hadoop vs. Spark—both handle distributed batch processing, but Hadoop writes intermediate results to disk while Spark keeps them in memory. If an assignment asks about iterative machine learning algorithms, Spark is your go-to example because repeated disk I/O kills performance.
Stream processors handle unbounded data—continuous flows where you never know when (or if) the input ends. The key challenge is maintaining state and ensuring exactly-once processing semantics while data keeps arriving.
Compare: Flink vs. Storm—both process streams in real-time, but Flink offers stronger guarantees (exactly-once semantics) and native event-time handling. Storm is simpler and lower-latency but pushes more complexity to the application developer. For reproducible data science, Flink's guarantees often matter more.
These systems don't process data themselves—they move it reliably between producers and consumers. Think of them as the nervous system connecting your distributed applications.
Compare: Kafka vs. traditional message queues—Kafka retains messages after consumption (configurable retention period), allowing replay and multiple consumers. This is essential for reproducible pipelines where you might need to reprocess historical data after fixing a bug.
These tools abstract away the underlying execution engine, letting you write code once and run it anywhere. Portability matters for collaborative work—your pipeline shouldn't break when a teammate uses a different cluster.
Compare: Beam vs. Dataflow—Beam is the open-source programming model; Dataflow is Google's managed service that runs Beam pipelines. In collaborative projects, Beam gives you portability while Dataflow reduces operational overhead.
Sometimes you need answers fast—these tools optimize for low-latency queries over large datasets, enabling exploratory analysis and business intelligence.
Compare: Presto vs. Dask—Presto excels at SQL queries across diverse data sources; Dask is better when you need programmatic data manipulation in Python. For reproducible notebooks, Dask integrates more naturally with Jupyter workflows.
| Concept | Best Examples |
|---|---|
| Batch processing (disk-based) | Hadoop MapReduce |
| Batch processing (in-memory) | Spark |
| True stream processing | Flink, Storm, Samza |
| Message streaming / integration | Kafka |
| Portable pipeline definitions | Beam, Dataflow |
| Interactive SQL queries | Presto |
| Python-native parallelism | Dask |
| Managed cloud services | Google Cloud Dataflow |
Which two frameworks both support stream processing but differ in their delivery guarantees—and when would the simpler guarantee be acceptable?
If you needed to reprocess six months of historical event data after discovering a bug in your pipeline, which messaging system's architecture makes this possible, and why?
Compare Spark and Hadoop: what specific architectural difference explains Spark's performance advantage for iterative machine learning algorithms?
You're building a data pipeline that needs to run on your team's local Spark cluster today but might move to Google Cloud next quarter. Which framework would you choose to write the pipeline, and what's the key benefit?
A collaborator asks whether to use Presto or Dask for exploring a 500GB dataset. What questions would you ask to help them decide, and what does each tool optimize for?