Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
When you're working with big data analytics and visualization, the framework you choose determines everythingโhow fast you can process data, whether you can handle real-time streams, and what kind of insights you can extract. You're being tested on understanding why different frameworks exist, when to use each one, and how they solve fundamentally different problems. The key concepts here include batch vs. stream processing, storage architecture, fault tolerance, and query optimization.
Don't fall into the trap of memorizing framework names and features in isolation. Instead, focus on the underlying principles: What problem does each framework solve? How does its architecture enable that solution? When would you choose one over another? These comparative questions are exactly what you'll face on exams and in real-world data engineering decisions.
These frameworks established the core paradigms for processing massive datasets by distributing work across clusters. The key insight: break big problems into smaller pieces, process them in parallel, then combine results.
Compare: Hadoop vs. Hiveโboth process batch data on HDFS, but Hadoop requires MapReduce programming while Hive provides SQL abstraction. If asked about enabling business analysts to query big data, Hive is your answer.
These frameworks dramatically improved processing speed by keeping data in memory rather than writing to disk between operations. The mechanism: RAM access is orders of magnitude faster than disk I/O.
Compare: Spark vs. Flinkโboth handle batch and streaming, but Spark's streaming is micro-batch (small batches processed quickly) while Flink is native streaming (event-by-event). For sub-second latency requirements, Flink is the stronger choice.
When milliseconds matter, these frameworks provide continuous processing of data as it arrives. The principle: process data in motion rather than at rest.
Compare: Storm vs. KafkaโStorm processes streams while Kafka transports and stores them. They're often used together: Kafka as the message backbone, Storm (or Flink/Spark) as the processing engine.
These NoSQL databases solve the problem of storing and retrieving massive datasets with low latency. The trade-off: they sacrifice some relational database features (joins, ACID transactions) for horizontal scalability.
Compare: HBase vs. Cassandraโboth are column-oriented and highly scalable, but HBase requires HDFS and provides strong consistency, while Cassandra is standalone with tunable consistency. Choose HBase for Hadoop ecosystems, Cassandra for always-available global applications.
When you need to search through massive text datasets or perform real-time aggregations, specialized engines outperform general-purpose databases. The mechanism: inverted indexes map terms to documents for sub-second full-text search.
Compare: MongoDB vs. Elasticsearchโboth store JSON documents, but MongoDB is optimized for CRUD operations and application data, while Elasticsearch excels at search and analytics. Many architectures use MongoDB as the primary store and sync to Elasticsearch for search.
| Concept | Best Examples |
|---|---|
| Batch Processing | Hadoop, Hive, Spark |
| Stream Processing | Flink, Storm, Kafka Streams |
| In-Memory Speed | Spark, Flink |
| Message Transport | Kafka |
| SQL on Big Data | Hive, Spark SQL |
| Column-Family Storage | HBase, Cassandra |
| Document Storage | MongoDB, Elasticsearch |
| Full-Text Search | Elasticsearch |
Which two frameworks both support batch and stream processing, and what architectural difference determines when you'd choose one over the other?
If you need to enable business analysts to query petabytes of historical data using SQL without writing code, which framework would you recommend and why?
Compare HBase and Cassandra: what underlying infrastructure requirement differs between them, and how does this affect their consistency guarantees?
A financial services company needs exactly-once processing guarantees for transaction streams with sub-second latency. Which framework best fits this requirement, and what feature enables this guarantee?
Explain why Kafka is often called the "central nervous system" of big data architecturesโwhat role does it play that differs from processing frameworks like Spark or Storm?