📊Big Data Analytics and Visualization

Major Big Data Frameworks

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

When you're working with big data analytics and visualization, the framework you choose determines everything—how fast you can process data, whether you can handle real-time streams, and what kind of insights you can extract. You're being tested on understanding why different frameworks exist, when to use each one, and how they solve fundamentally different problems. The key concepts here include batch vs. stream processing, storage architecture, fault tolerance, and query optimization.

Don't fall into the trap of memorizing framework names and features in isolation. Instead, focus on the underlying principles: What problem does each framework solve? How does its architecture enable that solution? When would you choose one over another? These comparative questions are exactly what you'll face on exams and in real-world data engineering decisions.

Batch Processing Foundations

These frameworks established the core paradigms for processing massive datasets by distributing work across clusters. The key insight: break big problems into smaller pieces, process them in parallel, then combine results.

Apache Hadoop

MapReduce programming model—splits data processing into map (transform) and reduce (aggregate) phases, enabling parallel computation across commodity hardware
HDFS (Hadoop Distributed File System) stores data in 128MB blocks replicated across multiple nodes, providing fault tolerance through redundancy
Batch-oriented architecture makes it ideal for ETL pipelines and historical analysis, though not suitable for low-latency requirements

Apache Hive

SQL-like interface (HiveQL) translates familiar queries into MapReduce jobs, lowering the barrier for analysts without programming backgrounds
Data warehousing layer sits on top of Hadoop, enabling schema-on-read for flexible data exploration
Partitioning and bucketing optimize query performance by reducing the amount of data scanned for filtered queries

Compare: Hadoop vs. Hive—both process batch data on HDFS, but Hadoop requires MapReduce programming while Hive provides SQL abstraction. If asked about enabling business analysts to query big data, Hive is your answer.

In-Memory and Hybrid Processing

These frameworks dramatically improved processing speed by keeping data in memory rather than writing to disk between operations. The mechanism: RAM access is orders of magnitude faster than disk I/O.

Apache Spark

In-memory computation through Resilient Distributed Datasets (RDDs) delivers up to 100x faster processing than disk-based MapReduce for iterative algorithms
Unified engine with built-in libraries for Spark SQL, MLlib, GraphX, and Spark Streaming eliminates the need for multiple specialized tools
Multi-language support (Python, Scala, Java, R) and DataFrame API make it the most accessible framework for data scientists

Apache Flink

True stream processing treats batch as a special case of streaming, using event time semantics rather than processing time for accurate results
Stateful computations maintain context across events, enabling complex patterns like sessionization and windowed aggregations
Exactly-once processing guarantees ensure data integrity even during failures, critical for financial and transactional applications

Compare: Spark vs. Flink—both handle batch and streaming, but Spark's streaming is micro-batch (small batches processed quickly) while Flink is native streaming (event-by-event). For sub-second latency requirements, Flink is the stronger choice.

Real-Time Stream Processing

When milliseconds matter, these frameworks provide continuous processing of data as it arrives. The principle: process data in motion rather than at rest.

Apache Storm

Topology-based architecture defines data flows through spouts (data sources) and bolts (processing nodes) for continuous, unbounded streams
At-least-once processing guarantees with sub-second latency, making it suitable for real-time alerting and monitoring
Fault tolerance through task reassignment—if a node fails, the supervisor automatically restarts tasks on available workers

Apache Kafka

Distributed commit log stores streams of records in fault-tolerant, ordered partitions, serving as the central nervous system for data pipelines
Publish-subscribe messaging with consumer groups allows multiple applications to read the same data independently at their own pace
High throughput (millions of messages per second) with configurable retention enables both real-time processing and historical replay

Compare: Storm vs. Kafka—Storm processes streams while Kafka transports and stores them. They're often used together: Kafka as the message backbone, Storm (or Flink/Spark) as the processing engine.

Distributed Storage Systems

These NoSQL databases solve the problem of storing and retrieving massive datasets with low latency. The trade-off: they sacrifice some relational database features (joins, ACID transactions) for horizontal scalability.

Apache HBase

Column-family storage model on top of HDFS provides random, real-time read/write access to billions of rows with millisecond latency
Sparse data optimization—only stores non-null values, making it efficient for wide tables where most cells are empty
Strong consistency within rows and tight Hadoop integration make it ideal for analytics workloads requiring both batch and real-time access

Apache Cassandra

Peer-to-peer architecture with no master node eliminates single points of failure, achieving high availability through decentralization
Tunable consistency lets you balance between strong consistency and availability per query, following the CAP theorem trade-offs
Multi-data center replication built into the core design supports global deployments with local read/write performance

MongoDB

Document-oriented storage uses flexible JSON-like documents (BSON), enabling schema evolution without migrations
Rich query language with secondary indexes, aggregation pipelines, and geospatial queries supports complex application requirements
Horizontal scaling through sharding distributes documents across clusters based on a shard key you define

Compare: HBase vs. Cassandra—both are column-oriented and highly scalable, but HBase requires HDFS and provides strong consistency, while Cassandra is standalone with tunable consistency. Choose HBase for Hadoop ecosystems, Cassandra for always-available global applications.

Search and Analytics Engines

When you need to search through massive text datasets or perform real-time aggregations, specialized engines outperform general-purpose databases. The mechanism: inverted indexes map terms to documents for sub-second full-text search.

Elasticsearch

Distributed search engine built on Apache Lucene provides near-real-time indexing and retrieval across petabytes of text data
RESTful API with JSON makes it accessible from any programming language, with powerful query DSL for full-text search, filtering, and fuzzy matching
Aggregations framework enables real-time analytics dashboards, often paired with Kibana for visualization in the "ELK stack"

Compare: MongoDB vs. Elasticsearch—both store JSON documents, but MongoDB is optimized for CRUD operations and application data, while Elasticsearch excels at search and analytics. Many architectures use MongoDB as the primary store and sync to Elasticsearch for search.

Quick Reference Table

Concept	Best Examples
Batch Processing	Hadoop, Hive, Spark
Stream Processing	Flink, Storm, Kafka Streams
In-Memory Speed	Spark, Flink
Message Transport	Kafka
SQL on Big Data	Hive, Spark SQL
Column-Family Storage	HBase, Cassandra
Document Storage	MongoDB, Elasticsearch
Full-Text Search	Elasticsearch

Self-Check Questions

Which two frameworks both support batch and stream processing, and what architectural difference determines when you'd choose one over the other?
If you need to enable business analysts to query petabytes of historical data using SQL without writing code, which framework would you recommend and why?
Compare HBase and Cassandra: what underlying infrastructure requirement differs between them, and how does this affect their consistency guarantees?
A financial services company needs exactly-once processing guarantees for transaction streams with sub-second latency. Which framework best fits this requirement, and what feature enables this guarantee?
Explain why Kafka is often called the "central nervous system" of big data architectures—what role does it play that differs from processing frameworks like Spark or Storm?

📊Big Data Analytics and Visualization

Major Big Data Frameworks

Why This Matters

Batch Processing Foundations

Apache Hadoop

Apache Hive

In-Memory and Hybrid Processing

Apache Spark

Apache Flink

Real-Time Stream Processing

Apache Storm

Apache Kafka

Distributed Storage Systems

Apache HBase

Apache Cassandra

MongoDB

Search and Analytics Engines

Elasticsearch

Quick Reference Table

Self-Check Questions

history

social science

english & capstone

arts

science

math & computer science

world languages

high school exams

honors classes

college classes

hs classes