Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
When you're working with big data analytics and visualization, the storage layer isn't just a place to dump files—it's the foundation that determines what kinds of analysis you can perform, how fast you can query data, and whether your system can scale as data volumes grow. You're being tested on understanding why different storage technologies exist and when to choose each one. The key principles here include distributed architecture, data modeling approaches (row vs. column), consistency vs. availability tradeoffs, and the distinction between operational and analytical workloads.
Don't just memorize product names and features. Know what problem each technology solves, how its architecture enables specific use cases, and how these tools fit together in a modern data pipeline. When an exam question asks you to design a storage solution or compare approaches, you need to think in terms of access patterns, scalability mechanisms, and data structure requirements—not just brand names.
These technologies provide the underlying storage infrastructure that other big data tools build upon. They break large files into blocks and distribute them across clusters of commodity hardware, trading single-machine simplicity for massive scalability and fault tolerance.
Compare: HDFS vs. S3—both provide distributed, fault-tolerant storage, but HDFS is designed for on-premise Hadoop clusters with high-throughput batch processing, while S3 offers managed cloud storage with flexible access patterns and pay-as-you-go economics. If asked to design a cloud-native data lake, S3 or GCS is your answer; for traditional Hadoop workloads, HDFS remains the standard.
When relational databases hit their limits, NoSQL databases offer alternative data models optimized for specific access patterns. The CAP theorem governs these systems: you can optimize for Consistency, Availability, or Partition tolerance—but not all three simultaneously.
Compare: HBase vs. Cassandra—both are wide-column stores designed for massive scale, but HBase prioritizes strong consistency and integrates tightly with Hadoop, while Cassandra prioritizes availability and partition tolerance with its masterless architecture. Choose HBase when you need HDFS integration and consistency; choose Cassandra for global distribution and write-heavy workloads.
Compare: MongoDB vs. Cassandra—MongoDB excels at flexible querying on semi-structured documents with its rich query language, while Cassandra excels at high-volume writes with predictable performance. If your use case requires complex queries, lean MongoDB; if you need extreme write throughput, lean Cassandra.
These technologies sit above raw storage to provide SQL-like access and analytical capabilities. They translate familiar query languages into distributed processing jobs, making big data accessible to analysts who know SQL.
Compare: Hive vs. Kudu—Hive provides SQL access to static data in HDFS with batch-oriented performance, while Kudu enables real-time analytics on mutable data. If your data is append-only and you're running scheduled reports, use Hive; if you need to query data that's constantly updating, Kudu is your tool.
The format you choose for storing data dramatically affects query performance, storage costs, and processing efficiency. Row-oriented formats optimize for writing complete records; column-oriented formats optimize for reading specific fields across many records.
Compare: Parquet vs. Avro—Parquet excels at analytical queries that scan specific columns across millions of rows (think aggregations and filters), while Avro excels at data serialization and streaming where you process complete records. Use Parquet for your data warehouse; use Avro for Kafka message streams and ETL pipelines.
| Concept | Best Examples |
|---|---|
| Distributed file storage | HDFS, Amazon S3, Google Cloud Storage |
| Strong consistency NoSQL | HBase, MongoDB |
| High availability NoSQL | Cassandra |
| Document-oriented storage | MongoDB |
| Wide-column stores | HBase, Cassandra |
| SQL-on-Hadoop | Hive |
| Real-time analytics | Kudu, HBase |
| Columnar file format | Parquet |
| Row-oriented serialization | Avro |
| Schema evolution support | Avro, Parquet |
| Cloud-native object storage | Amazon S3, Google Cloud Storage |
Which two technologies would you compare when designing a system that needs both real-time updates and analytical queries on the same dataset? What tradeoffs does each represent?
If you're building a data pipeline where upstream schemas might change over time, which file format should you choose and why? How does it handle schema evolution differently than alternatives?
Compare and contrast HBase and Cassandra in terms of their consistency models and architecture. In what scenario would you choose one over the other?
A data engineer needs to optimize storage costs for a data lake while maintaining fast analytical query performance. Which file format and storage class combination would you recommend, and what principles guide this choice?
You're asked to design storage for a system that receives millions of writes per second across global data centers with no tolerance for downtime. Which technology best fits this requirement, and what architectural feature makes it suitable?