upgrade
upgrade

📊Big Data Analytics and Visualization

Key Big Data Storage Technologies

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

When you're working with big data analytics and visualization, the storage layer isn't just a place to dump files—it's the foundation that determines what kinds of analysis you can perform, how fast you can query data, and whether your system can scale as data volumes grow. You're being tested on understanding why different storage technologies exist and when to choose each one. The key principles here include distributed architecture, data modeling approaches (row vs. column), consistency vs. availability tradeoffs, and the distinction between operational and analytical workloads.

Don't just memorize product names and features. Know what problem each technology solves, how its architecture enables specific use cases, and how these tools fit together in a modern data pipeline. When an exam question asks you to design a storage solution or compare approaches, you need to think in terms of access patterns, scalability mechanisms, and data structure requirements—not just brand names.


Distributed File Systems: The Foundation Layer

These technologies provide the underlying storage infrastructure that other big data tools build upon. They break large files into blocks and distribute them across clusters of commodity hardware, trading single-machine simplicity for massive scalability and fault tolerance.

Hadoop Distributed File System (HDFS)

  • Block-based distributed storage—splits files into 128MB blocks (default) and replicates each block across multiple nodes for fault tolerance
  • Write-once, read-many architecture optimized for batch processing workloads rather than real-time updates
  • High throughput over low latency—designed for streaming large datasets sequentially, not random access patterns

Amazon S3

  • Object storage model—stores data as discrete objects with metadata and unique identifiers, not as hierarchical file systems
  • Eleven 9s durability (99.999999999%)—achieved through automatic replication across multiple availability zones
  • Pay-per-use pricing with storage classes (Standard, Glacier, etc.) that let you optimize cost based on access frequency

Google Cloud Storage

  • Unified object storage with automatic data redundancy across geographic regions for disaster recovery
  • Storage class tiering—Standard, Nearline, Coldline, and Archive classes match cost to access patterns (hot data vs. cold data)
  • Native integration with BigQuery, Dataflow, and other GCP analytics services for seamless data pipelines

Compare: HDFS vs. S3—both provide distributed, fault-tolerant storage, but HDFS is designed for on-premise Hadoop clusters with high-throughput batch processing, while S3 offers managed cloud storage with flexible access patterns and pay-as-you-go economics. If asked to design a cloud-native data lake, S3 or GCS is your answer; for traditional Hadoop workloads, HDFS remains the standard.


NoSQL Databases: Flexible Data Models at Scale

When relational databases hit their limits, NoSQL databases offer alternative data models optimized for specific access patterns. The CAP theorem governs these systems: you can optimize for Consistency, Availability, or Partition tolerance—but not all three simultaneously.

Apache HBase

  • Wide-column store built on HDFS that provides random, real-time read/write access to big data
  • Strong consistency model—guarantees that reads return the most recent write, unlike eventually consistent systems
  • Sparse data optimization—stores only non-null values, making it ideal for datasets with many empty fields like time-series or sensor data

Apache Cassandra

  • Peer-to-peer architecture with no single point of failure—every node can handle read/write requests
  • Tunable consistency—lets you choose consistency level per query, balancing between strong consistency and high availability
  • Linear scalability—adding nodes increases throughput proportionally, designed for write-heavy workloads across data centers

MongoDB

  • Document-oriented model—stores data in flexible BSON (binary JSON) documents that can vary in structure
  • Rich query language with support for ad-hoc queries, indexing, and aggregation pipelines for complex analytics
  • Horizontal scaling via sharding—automatically distributes data across shards based on a shard key you define

Compare: HBase vs. Cassandra—both are wide-column stores designed for massive scale, but HBase prioritizes strong consistency and integrates tightly with Hadoop, while Cassandra prioritizes availability and partition tolerance with its masterless architecture. Choose HBase when you need HDFS integration and consistency; choose Cassandra for global distribution and write-heavy workloads.

Compare: MongoDB vs. Cassandra—MongoDB excels at flexible querying on semi-structured documents with its rich query language, while Cassandra excels at high-volume writes with predictable performance. If your use case requires complex queries, lean MongoDB; if you need extreme write throughput, lean Cassandra.


Data Warehouse and Query Layers

These technologies sit above raw storage to provide SQL-like access and analytical capabilities. They translate familiar query languages into distributed processing jobs, making big data accessible to analysts who know SQL.

Apache Hive

  • SQL-on-Hadoop—translates HiveQL queries into MapReduce, Tez, or Spark jobs for distributed execution
  • Schema-on-read approach—applies structure to data at query time rather than requiring predefined schemas at load time
  • Partitioning and bucketing optimize query performance by organizing data into directories based on column values

Apache Kudu

  • Hybrid storage engine combining columnar storage for analytics with fast row-level updates for operational workloads
  • Low-latency analytics—integrates with Impala for sub-second SQL queries on data that's still being updated
  • Fills the gap between HDFS (batch-optimized) and HBase (random-access-optimized) for use cases needing both

Compare: Hive vs. Kudu—Hive provides SQL access to static data in HDFS with batch-oriented performance, while Kudu enables real-time analytics on mutable data. If your data is append-only and you're running scheduled reports, use Hive; if you need to query data that's constantly updating, Kudu is your tool.


File Formats: How Data Gets Serialized

The format you choose for storing data dramatically affects query performance, storage costs, and processing efficiency. Row-oriented formats optimize for writing complete records; column-oriented formats optimize for reading specific fields across many records.

Apache Parquet

  • Columnar format—stores data by column rather than row, enabling queries to read only the columns they need
  • Efficient compression—similar values in columns compress better than mixed values in rows, reducing storage by 75% or more
  • Nested data support—handles complex, hierarchical data structures common in JSON-like documents

Apache Avro

  • Row-oriented format optimized for write-heavy workloads and complete record serialization
  • Schema evolution—supports adding, removing, or modifying fields without breaking compatibility with existing data
  • Self-describing format—embeds schema with data, making it ideal for data exchange between different systems

Compare: Parquet vs. Avro—Parquet excels at analytical queries that scan specific columns across millions of rows (think aggregations and filters), while Avro excels at data serialization and streaming where you process complete records. Use Parquet for your data warehouse; use Avro for Kafka message streams and ETL pipelines.


Quick Reference Table

ConceptBest Examples
Distributed file storageHDFS, Amazon S3, Google Cloud Storage
Strong consistency NoSQLHBase, MongoDB
High availability NoSQLCassandra
Document-oriented storageMongoDB
Wide-column storesHBase, Cassandra
SQL-on-HadoopHive
Real-time analyticsKudu, HBase
Columnar file formatParquet
Row-oriented serializationAvro
Schema evolution supportAvro, Parquet
Cloud-native object storageAmazon S3, Google Cloud Storage

Self-Check Questions

  1. Which two technologies would you compare when designing a system that needs both real-time updates and analytical queries on the same dataset? What tradeoffs does each represent?

  2. If you're building a data pipeline where upstream schemas might change over time, which file format should you choose and why? How does it handle schema evolution differently than alternatives?

  3. Compare and contrast HBase and Cassandra in terms of their consistency models and architecture. In what scenario would you choose one over the other?

  4. A data engineer needs to optimize storage costs for a data lake while maintaining fast analytical query performance. Which file format and storage class combination would you recommend, and what principles guide this choice?

  5. You're asked to design storage for a system that receives millions of writes per second across global data centers with no tolerance for downtime. Which technology best fits this requirement, and what architectural feature makes it suitable?