📊Big Data Analytics and Visualization

Key Big Data Storage Technologies

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

When you're working with big data analytics and visualization, the storage layer isn't just a place to dump files—it's the foundation that determines what kinds of analysis you can perform, how fast you can query data, and whether your system can scale as data volumes grow. You're being tested on understanding why different storage technologies exist and when to choose each one. The key principles here include distributed architecture, data modeling approaches (row vs. column), consistency vs. availability tradeoffs, and the distinction between operational and analytical workloads.

Don't just memorize product names and features. Know what problem each technology solves, how its architecture enables specific use cases, and how these tools fit together in a modern data pipeline. When an exam question asks you to design a storage solution or compare approaches, you need to think in terms of access patterns, scalability mechanisms, and data structure requirements—not just brand names.

Distributed File Systems: The Foundation Layer

These technologies provide the underlying storage infrastructure that other big data tools build upon. They break large files into blocks and distribute them across clusters of commodity hardware, trading single-machine simplicity for massive scalability and fault tolerance.

Hadoop Distributed File System (HDFS)

Block-based distributed storage—splits files into 128MB blocks (default) and replicates each block across multiple nodes for fault tolerance
Write-once, read-many architecture optimized for batch processing workloads rather than real-time updates
High throughput over low latency—designed for streaming large datasets sequentially, not random access patterns

Amazon S3

Object storage model—stores data as discrete objects with metadata and unique identifiers, not as hierarchical file systems
Eleven 9s durability (99.999999999%)—achieved through automatic replication across multiple availability zones
Pay-per-use pricing with storage classes (Standard, Glacier, etc.) that let you optimize cost based on access frequency

Google Cloud Storage

Unified object storage with automatic data redundancy across geographic regions for disaster recovery
Storage class tiering—Standard, Nearline, Coldline, and Archive classes match cost to access patterns (hot data vs. cold data)
Native integration with BigQuery, Dataflow, and other GCP analytics services for seamless data pipelines

Compare: HDFS vs. S3—both provide distributed, fault-tolerant storage, but HDFS is designed for on-premise Hadoop clusters with high-throughput batch processing, while S3 offers managed cloud storage with flexible access patterns and pay-as-you-go economics. If asked to design a cloud-native data lake, S3 or GCS is your answer; for traditional Hadoop workloads, HDFS remains the standard.

NoSQL Databases: Flexible Data Models at Scale

When relational databases hit their limits, NoSQL databases offer alternative data models optimized for specific access patterns. The CAP theorem governs these systems: you can optimize for Consistency, Availability, or Partition tolerance—but not all three simultaneously.

Apache HBase

Wide-column store built on HDFS that provides random, real-time read/write access to big data
Strong consistency model—guarantees that reads return the most recent write, unlike eventually consistent systems
Sparse data optimization—stores only non-null values, making it ideal for datasets with many empty fields like time-series or sensor data

Apache Cassandra

Peer-to-peer architecture with no single point of failure—every node can handle read/write requests
Tunable consistency—lets you choose consistency level per query, balancing between strong consistency and high availability
Linear scalability—adding nodes increases throughput proportionally, designed for write-heavy workloads across data centers

MongoDB

Document-oriented model—stores data in flexible BSON (binary JSON) documents that can vary in structure
Rich query language with support for ad-hoc queries, indexing, and aggregation pipelines for complex analytics
Horizontal scaling via sharding—automatically distributes data across shards based on a shard key you define

Compare: HBase vs. Cassandra—both are wide-column stores designed for massive scale, but HBase prioritizes strong consistency and integrates tightly with Hadoop, while Cassandra prioritizes availability and partition tolerance with its masterless architecture. Choose HBase when you need HDFS integration and consistency; choose Cassandra for global distribution and write-heavy workloads.

Compare: MongoDB vs. Cassandra—MongoDB excels at flexible querying on semi-structured documents with its rich query language, while Cassandra excels at high-volume writes with predictable performance. If your use case requires complex queries, lean MongoDB; if you need extreme write throughput, lean Cassandra.

Data Warehouse and Query Layers

These technologies sit above raw storage to provide SQL-like access and analytical capabilities. They translate familiar query languages into distributed processing jobs, making big data accessible to analysts who know SQL.

Apache Hive

SQL-on-Hadoop—translates HiveQL queries into MapReduce, Tez, or Spark jobs for distributed execution
Schema-on-read approach—applies structure to data at query time rather than requiring predefined schemas at load time
Partitioning and bucketing optimize query performance by organizing data into directories based on column values

Apache Kudu

Hybrid storage engine combining columnar storage for analytics with fast row-level updates for operational workloads
Low-latency analytics—integrates with Impala for sub-second SQL queries on data that's still being updated
Fills the gap between HDFS (batch-optimized) and HBase (random-access-optimized) for use cases needing both

Compare: Hive vs. Kudu—Hive provides SQL access to static data in HDFS with batch-oriented performance, while Kudu enables real-time analytics on mutable data. If your data is append-only and you're running scheduled reports, use Hive; if you need to query data that's constantly updating, Kudu is your tool.

File Formats: How Data Gets Serialized

The format you choose for storing data dramatically affects query performance, storage costs, and processing efficiency. Row-oriented formats optimize for writing complete records; column-oriented formats optimize for reading specific fields across many records.

Apache Parquet

Columnar format—stores data by column rather than row, enabling queries to read only the columns they need
Efficient compression—similar values in columns compress better than mixed values in rows, reducing storage by 75% or more
Nested data support—handles complex, hierarchical data structures common in JSON-like documents

Apache Avro

Row-oriented format optimized for write-heavy workloads and complete record serialization
Schema evolution—supports adding, removing, or modifying fields without breaking compatibility with existing data
Self-describing format—embeds schema with data, making it ideal for data exchange between different systems

Compare: Parquet vs. Avro—Parquet excels at analytical queries that scan specific columns across millions of rows (think aggregations and filters), while Avro excels at data serialization and streaming where you process complete records. Use Parquet for your data warehouse; use Avro for Kafka message streams and ETL pipelines.

Quick Reference Table

Concept	Best Examples
Distributed file storage	HDFS, Amazon S3, Google Cloud Storage
Strong consistency NoSQL	HBase, MongoDB
High availability NoSQL	Cassandra
Document-oriented storage	MongoDB
Wide-column stores	HBase, Cassandra
SQL-on-Hadoop	Hive
Real-time analytics	Kudu, HBase
Columnar file format	Parquet
Row-oriented serialization	Avro
Schema evolution support	Avro, Parquet
Cloud-native object storage	Amazon S3, Google Cloud Storage

Self-Check Questions

Which two technologies would you compare when designing a system that needs both real-time updates and analytical queries on the same dataset? What tradeoffs does each represent?
If you're building a data pipeline where upstream schemas might change over time, which file format should you choose and why? How does it handle schema evolution differently than alternatives?
Compare and contrast HBase and Cassandra in terms of their consistency models and architecture. In what scenario would you choose one over the other?
A data engineer needs to optimize storage costs for a data lake while maintaining fast analytical query performance. Which file format and storage class combination would you recommend, and what principles guide this choice?
You're asked to design storage for a system that receives millions of writes per second across global data centers with no tolerance for downtime. Which technology best fits this requirement, and what architectural feature makes it suitable?

📊Big Data Analytics and Visualization

Key Big Data Storage Technologies

Why This Matters

Distributed File Systems: The Foundation Layer

Hadoop Distributed File System (HDFS)

Amazon S3

Google Cloud Storage

NoSQL Databases: Flexible Data Models at Scale

Apache HBase

Apache Cassandra

MongoDB

Data Warehouse and Query Layers

Apache Hive

Apache Kudu

File Formats: How Data Gets Serialized

Apache Parquet

Apache Avro

Quick Reference Table

Self-Check Questions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

hs classes