Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Big data processing sits at the heart of modern computer science—and the AP exam expects you to understand not just what these tools do, but why different architectures exist for different problems. You're being tested on concepts like distributed computing, batch vs. stream processing, fault tolerance, and data storage paradigms. These tools represent real-world solutions to the fundamental challenge of processing data that's too large for a single machine.
Don't just memorize tool names and features. Instead, focus on understanding the underlying principles: Why would you choose in-memory processing over disk-based? When does stream processing beat batch processing? What trade-offs do NoSQL databases make? Each tool in this guide illustrates a specific architectural decision—know the concept, and you'll be able to answer any question they throw at you.
These frameworks process large, bounded datasets in chunks—ideal when you need to analyze historical data and can tolerate some latency. Batch processing trades speed for thoroughness, processing complete datasets rather than individual records as they arrive.
Compare: Hadoop vs. Spark—both handle distributed batch processing, but Hadoop writes intermediate results to disk while Spark keeps them in memory. If an FRQ asks about processing speed vs. cost trade-offs, this is your go-to comparison.
Stream processing handles unbounded data—continuous flows of events that need real-time or near-real-time analysis. These frameworks process records as they arrive rather than waiting for complete datasets.
Compare: Flink vs. Storm—both handle real-time streams, but Flink supports exactly-once semantics and event time processing natively, while Storm traditionally offered at-least-once guarantees. Flink is generally preferred for complex stateful applications.
Before you can process data, you need to collect and distribute it reliably. Message brokers decouple data producers from consumers, enabling scalable and fault-tolerant data pipelines.
These tools provide familiar interfaces (like SQL) for querying big data systems, bridging the gap between traditional database skills and distributed processing.
Compare: Hive vs. Pig—Hive uses SQL-like syntax familiar to database users, while Pig uses a procedural scripting language. Hive is better for ad-hoc queries; Pig excels at complex multi-step transformations.
NoSQL databases sacrifice some traditional database guarantees for scalability and flexibility. They're designed for distributed environments where relational databases struggle with volume, velocity, or variety of data.
Compare: MongoDB vs. Cassandra—MongoDB offers richer queries and flexible documents, while Cassandra provides better write performance and availability. MongoDB suits varied query patterns; Cassandra excels at high-volume writes with simple lookups.
When you need to search through massive datasets in milliseconds, specialized engines optimize for query speed over write efficiency.
| Concept | Best Examples |
|---|---|
| Batch Processing | Hadoop (disk-based), Spark (in-memory) |
| Stream Processing | Flink (stateful), Storm (topology-based) |
| Message Brokering | Kafka |
| SQL on Big Data | Hive (SQL-like), Pig (scripting) |
| Document Storage | MongoDB |
| High-Availability NoSQL | Cassandra |
| Full-Text Search | Elasticsearch |
| In-Memory Speed | Spark, Elasticsearch |
| Fault Tolerance | Kafka, Cassandra, HDFS |
Which two frameworks both handle distributed batch processing, and what's the key architectural difference that makes one faster for iterative algorithms?
If you needed to process a continuous stream of sensor data with complex event patterns and exactly-once guarantees, which tool would you choose and why?
Compare MongoDB and Cassandra: What type of workload is each optimized for, and what consistency trade-offs does each make?
An FRQ asks you to design a real-time analytics pipeline. Which tool would you use to collect events from multiple sources, and which would you use to make that data searchable in milliseconds?
Explain why Hive was created for the Hadoop ecosystem—what problem does it solve for organizations with existing SQL expertise?