upgrade
upgrade

๐Ÿ“ŠBig Data Analytics and Visualization

Major Big Data Frameworks

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

When you're working with big data analytics and visualization, the framework you choose determines everythingโ€”how fast you can process data, whether you can handle real-time streams, and what kind of insights you can extract. You're being tested on understanding why different frameworks exist, when to use each one, and how they solve fundamentally different problems. The key concepts here include batch vs. stream processing, storage architecture, fault tolerance, and query optimization.

Don't fall into the trap of memorizing framework names and features in isolation. Instead, focus on the underlying principles: What problem does each framework solve? How does its architecture enable that solution? When would you choose one over another? These comparative questions are exactly what you'll face on exams and in real-world data engineering decisions.


Batch Processing Foundations

These frameworks established the core paradigms for processing massive datasets by distributing work across clusters. The key insight: break big problems into smaller pieces, process them in parallel, then combine results.

Apache Hadoop

  • MapReduce programming modelโ€”splits data processing into map (transform) and reduce (aggregate) phases, enabling parallel computation across commodity hardware
  • HDFS (Hadoop Distributed File System) stores data in 128MB blocks replicated across multiple nodes, providing fault tolerance through redundancy
  • Batch-oriented architecture makes it ideal for ETL pipelines and historical analysis, though not suitable for low-latency requirements

Apache Hive

  • SQL-like interface (HiveQL) translates familiar queries into MapReduce jobs, lowering the barrier for analysts without programming backgrounds
  • Data warehousing layer sits on top of Hadoop, enabling schema-on-read for flexible data exploration
  • Partitioning and bucketing optimize query performance by reducing the amount of data scanned for filtered queries

Compare: Hadoop vs. Hiveโ€”both process batch data on HDFS, but Hadoop requires MapReduce programming while Hive provides SQL abstraction. If asked about enabling business analysts to query big data, Hive is your answer.


In-Memory and Hybrid Processing

These frameworks dramatically improved processing speed by keeping data in memory rather than writing to disk between operations. The mechanism: RAM access is orders of magnitude faster than disk I/O.

Apache Spark

  • In-memory computation through Resilient Distributed Datasets (RDDs) delivers up to 100x faster processing than disk-based MapReduce for iterative algorithms
  • Unified engine with built-in libraries for Spark SQL, MLlib, GraphX, and Spark Streaming eliminates the need for multiple specialized tools
  • Multi-language support (Python, Scala, Java, R) and DataFrame API make it the most accessible framework for data scientists
  • True stream processing treats batch as a special case of streaming, using event time semantics rather than processing time for accurate results
  • Stateful computations maintain context across events, enabling complex patterns like sessionization and windowed aggregations
  • Exactly-once processing guarantees ensure data integrity even during failures, critical for financial and transactional applications

Compare: Spark vs. Flinkโ€”both handle batch and streaming, but Spark's streaming is micro-batch (small batches processed quickly) while Flink is native streaming (event-by-event). For sub-second latency requirements, Flink is the stronger choice.


Real-Time Stream Processing

When milliseconds matter, these frameworks provide continuous processing of data as it arrives. The principle: process data in motion rather than at rest.

Apache Storm

  • Topology-based architecture defines data flows through spouts (data sources) and bolts (processing nodes) for continuous, unbounded streams
  • At-least-once processing guarantees with sub-second latency, making it suitable for real-time alerting and monitoring
  • Fault tolerance through task reassignmentโ€”if a node fails, the supervisor automatically restarts tasks on available workers

Apache Kafka

  • Distributed commit log stores streams of records in fault-tolerant, ordered partitions, serving as the central nervous system for data pipelines
  • Publish-subscribe messaging with consumer groups allows multiple applications to read the same data independently at their own pace
  • High throughput (millions of messages per second) with configurable retention enables both real-time processing and historical replay

Compare: Storm vs. Kafkaโ€”Storm processes streams while Kafka transports and stores them. They're often used together: Kafka as the message backbone, Storm (or Flink/Spark) as the processing engine.


Distributed Storage Systems

These NoSQL databases solve the problem of storing and retrieving massive datasets with low latency. The trade-off: they sacrifice some relational database features (joins, ACID transactions) for horizontal scalability.

Apache HBase

  • Column-family storage model on top of HDFS provides random, real-time read/write access to billions of rows with millisecond latency
  • Sparse data optimizationโ€”only stores non-null values, making it efficient for wide tables where most cells are empty
  • Strong consistency within rows and tight Hadoop integration make it ideal for analytics workloads requiring both batch and real-time access

Apache Cassandra

  • Peer-to-peer architecture with no master node eliminates single points of failure, achieving high availability through decentralization
  • Tunable consistency lets you balance between strong consistency and availability per query, following the CAP theorem trade-offs
  • Multi-data center replication built into the core design supports global deployments with local read/write performance

MongoDB

  • Document-oriented storage uses flexible JSON-like documents (BSON), enabling schema evolution without migrations
  • Rich query language with secondary indexes, aggregation pipelines, and geospatial queries supports complex application requirements
  • Horizontal scaling through sharding distributes documents across clusters based on a shard key you define

Compare: HBase vs. Cassandraโ€”both are column-oriented and highly scalable, but HBase requires HDFS and provides strong consistency, while Cassandra is standalone with tunable consistency. Choose HBase for Hadoop ecosystems, Cassandra for always-available global applications.


Search and Analytics Engines

When you need to search through massive text datasets or perform real-time aggregations, specialized engines outperform general-purpose databases. The mechanism: inverted indexes map terms to documents for sub-second full-text search.

Elasticsearch

  • Distributed search engine built on Apache Lucene provides near-real-time indexing and retrieval across petabytes of text data
  • RESTful API with JSON makes it accessible from any programming language, with powerful query DSL for full-text search, filtering, and fuzzy matching
  • Aggregations framework enables real-time analytics dashboards, often paired with Kibana for visualization in the "ELK stack"

Compare: MongoDB vs. Elasticsearchโ€”both store JSON documents, but MongoDB is optimized for CRUD operations and application data, while Elasticsearch excels at search and analytics. Many architectures use MongoDB as the primary store and sync to Elasticsearch for search.


Quick Reference Table

ConceptBest Examples
Batch ProcessingHadoop, Hive, Spark
Stream ProcessingFlink, Storm, Kafka Streams
In-Memory SpeedSpark, Flink
Message TransportKafka
SQL on Big DataHive, Spark SQL
Column-Family StorageHBase, Cassandra
Document StorageMongoDB, Elasticsearch
Full-Text SearchElasticsearch

Self-Check Questions

  1. Which two frameworks both support batch and stream processing, and what architectural difference determines when you'd choose one over the other?

  2. If you need to enable business analysts to query petabytes of historical data using SQL without writing code, which framework would you recommend and why?

  3. Compare HBase and Cassandra: what underlying infrastructure requirement differs between them, and how does this affect their consistency guarantees?

  4. A financial services company needs exactly-once processing guarantees for transaction streams with sub-second latency. Which framework best fits this requirement, and what feature enables this guarantee?

  5. Explain why Kafka is often called the "central nervous system" of big data architecturesโ€”what role does it play that differs from processing frameworks like Spark or Storm?