upgrade
upgrade

⛱️Cognitive Computing in Business

Key Concepts in Big Data Technologies

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

In cognitive computing and business analytics, your ability to choose the right technology for a specific data challenge separates strategic thinkers from those who just memorize tool names. You're being tested on understanding why certain technologies excel at batch processing while others dominate real-time streaming, and how these architectural differences translate into business value. The technologies in this guide form the backbone of modern data infrastructure—from customer analytics to fraud detection to recommendation engines.

Don't just memorize what each tool does. Instead, focus on the underlying principles: distributed computing, stream vs. batch processing, data modeling flexibility, and the speed-scale tradeoff. When an exam question asks which technology fits a scenario, you need to match the business requirement to the architectural strength. That's the skill that matters.


Distributed Storage and Batch Processing

These foundational technologies solve the problem of storing and processing datasets too large for any single machine. The core principle is horizontal scaling—distributing work across clusters of commodity hardware rather than relying on expensive supercomputers.

Hadoop

  • HDFS (Hadoop Distributed File System) provides fault-tolerant storage by replicating data blocks across multiple nodes—if one server fails, your data survives
  • MapReduce paradigm breaks massive computations into parallel tasks, making it ideal for batch processing jobs like historical data analysis
  • Cost-effective scalability using commodity hardware made big data accessible to organizations beyond tech giants, though it's slower than in-memory alternatives

Apache Hive

  • SQL-like interface (HiveQL) enables analysts familiar with traditional databases to query massive datasets without learning Java or Scala
  • Data warehousing layer sits on top of Hadoop, translating queries into MapReduce jobs—perfect for business intelligence and reporting use cases
  • Schema-on-read flexibility allows you to impose structure at query time rather than ingestion, accommodating evolving data formats

Compare: Hadoop vs. Hive—both operate on the same underlying infrastructure, but Hadoop requires programming expertise while Hive democratizes access through SQL. If an exam scenario involves business analysts needing ad-hoc queries on historical data, Hive is your answer.


Real-Time Stream Processing

When businesses need insights in milliseconds rather than hours, stream processing engines deliver. These technologies process data as it arrives, enabling immediate reactions to events rather than retrospective analysis.

Apache Kafka

  • Publish-subscribe messaging allows multiple applications to consume the same data stream simultaneously—critical for event-driven architectures
  • High throughput with durability handles millions of messages per second while persisting data to disk, making it the backbone of real-time data pipelines
  • Decoupling producers and consumers means your data ingestion and processing systems can scale independently, a key architectural advantage

Apache Spark

  • In-memory processing delivers speeds up to 100x faster than Hadoop MapReduce for iterative algorithms like machine learning training
  • Unified engine handles batch, streaming (Spark Streaming), SQL, and ML workloads in one framework—reducing operational complexity
  • DataFrame API provides familiar tabular operations across Java, Scala, Python, and R, lowering the barrier for data scientists
  • True stream-first architecture processes events one at a time with millisecond latency, unlike Spark's micro-batch approach
  • Event time processing handles out-of-order data correctly—essential when network delays scramble message arrival times
  • Stateful computations maintain context across events, enabling complex patterns like sessionization and fraud detection windows

Apache Storm

  • Guaranteed message processing ensures every tuple is fully processed or explicitly failed, critical for mission-critical applications
  • Topology-based architecture lets you design complex processing graphs with spouts (data sources) and bolts (processing units)
  • Sub-second latency makes it suitable for real-time monitoring, alerting, and operational dashboards

Compare: Spark vs. Flink—both handle streaming, but Spark uses micro-batches (small batch jobs in rapid succession) while Flink processes true streams. For sub-second latency requirements like fraud detection, Flink wins; for unified batch-and-stream workflows, Spark's ecosystem is more mature.


Flexible Data Storage

Traditional relational databases struggle with unstructured data and horizontal scaling. NoSQL databases sacrifice some SQL guarantees (like ACID transactions) in exchange for flexibility, scalability, and performance at massive scale.

NoSQL Databases (MongoDB, Cassandra)

  • Schema flexibility allows documents with different structures to coexist—ideal for rapidly evolving applications where data models change frequently
  • Horizontal scalability distributes data across nodes automatically; adding servers increases capacity linearly without downtime
  • Multiple data models (document, key-value, column-family, graph) let you match storage architecture to your specific access patterns

Elasticsearch

  • Inverted index architecture enables sub-second full-text search across billions of documents—the technology behind most search boxes you use daily
  • Real-time analytics aggregates and visualizes log data, metrics, and events as they arrive, powering observability platforms
  • Distributed by design automatically shards and replicates data, providing both performance and fault tolerance

Compare: MongoDB vs. Elasticsearch—both handle unstructured data, but MongoDB optimizes for flexible document storage and CRUD operations, while Elasticsearch excels at search and analytics. Choose MongoDB for your application's primary datastore; add Elasticsearch when you need powerful search capabilities.


Machine Learning Infrastructure

Cognitive computing requires specialized frameworks that can train models on massive datasets and deploy them at scale. These tools abstract away the complexity of distributed computation, letting data scientists focus on algorithms rather than infrastructure.

TensorFlow

  • Computational graph architecture optimizes operations across CPUs, GPUs, and TPUs, enabling training of models with billions of parameters
  • End-to-end ecosystem covers everything from data preprocessing (TFX) to model serving (TensorFlow Serving) to mobile deployment (TensorFlow Lite)
  • Deep learning dominance in image recognition, NLP, and recommendation systems makes it essential knowledge for cognitive computing applications

Data Visualization and Business Intelligence

Raw data and model outputs only create business value when stakeholders can understand and act on them. Visualization tools bridge the gap between technical analysis and executive decision-making.

Tableau

  • Drag-and-drop interface enables business users to explore data without SQL or programming knowledge—democratizing analytics
  • Live connections to databases, data warehouses, and cloud platforms ensure dashboards reflect current data without manual refreshes
  • Interactive storytelling transforms complex datasets into compelling narratives that drive data-informed decisions across organizations

Compare: Elasticsearch vs. Tableau—Elasticsearch provides real-time search and analytics on raw data, while Tableau creates polished visualizations for business consumption. Technical teams use Elasticsearch for operational monitoring; executives consume Tableau dashboards for strategic decisions.


Quick Reference Table

ConceptBest Examples
Batch ProcessingHadoop, Hive, Spark (batch mode)
Real-Time StreamingKafka, Flink, Storm, Spark Streaming
Data Pipeline/MessagingKafka
Flexible Document StorageMongoDB, NoSQL databases
Search and Log AnalyticsElasticsearch
SQL on Big DataHive, Spark SQL
Machine Learning at ScaleTensorFlow, Spark MLlib
Business VisualizationTableau

Self-Check Questions

  1. A financial services company needs to detect fraudulent transactions within 100 milliseconds of occurrence. Which two technologies would you recommend for the streaming pipeline, and why might you choose Flink over Spark for this use case?

  2. Compare and contrast Hadoop and Spark for a scenario where a retailer wants to analyze five years of transaction history to identify seasonal purchasing patterns. What's the key tradeoff?

  3. Your company's data schema changes frequently as product features evolve. Which storage technology accommodates this flexibility, and what SQL guarantee does it typically sacrifice?

  4. An FRQ describes a company that needs to: (a) ingest clickstream data in real-time, (b) store it for historical analysis, and (c) enable business analysts to query it with SQL. Map each requirement to the most appropriate technology.

  5. Why might an organization use both Elasticsearch and Tableau in their data stack? What distinct business needs does each address?