Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
In cognitive computing and business analytics, your ability to choose the right technology for a specific data challenge separates strategic thinkers from those who just memorize tool names. You're being tested on understanding why certain technologies excel at batch processing while others dominate real-time streaming, and how these architectural differences translate into business value. The technologies in this guide form the backbone of modern data infrastructure—from customer analytics to fraud detection to recommendation engines.
Don't just memorize what each tool does. Instead, focus on the underlying principles: distributed computing, stream vs. batch processing, data modeling flexibility, and the speed-scale tradeoff. When an exam question asks which technology fits a scenario, you need to match the business requirement to the architectural strength. That's the skill that matters.
These foundational technologies solve the problem of storing and processing datasets too large for any single machine. The core principle is horizontal scaling—distributing work across clusters of commodity hardware rather than relying on expensive supercomputers.
Compare: Hadoop vs. Hive—both operate on the same underlying infrastructure, but Hadoop requires programming expertise while Hive democratizes access through SQL. If an exam scenario involves business analysts needing ad-hoc queries on historical data, Hive is your answer.
When businesses need insights in milliseconds rather than hours, stream processing engines deliver. These technologies process data as it arrives, enabling immediate reactions to events rather than retrospective analysis.
Compare: Spark vs. Flink—both handle streaming, but Spark uses micro-batches (small batch jobs in rapid succession) while Flink processes true streams. For sub-second latency requirements like fraud detection, Flink wins; for unified batch-and-stream workflows, Spark's ecosystem is more mature.
Traditional relational databases struggle with unstructured data and horizontal scaling. NoSQL databases sacrifice some SQL guarantees (like ACID transactions) in exchange for flexibility, scalability, and performance at massive scale.
Compare: MongoDB vs. Elasticsearch—both handle unstructured data, but MongoDB optimizes for flexible document storage and CRUD operations, while Elasticsearch excels at search and analytics. Choose MongoDB for your application's primary datastore; add Elasticsearch when you need powerful search capabilities.
Cognitive computing requires specialized frameworks that can train models on massive datasets and deploy them at scale. These tools abstract away the complexity of distributed computation, letting data scientists focus on algorithms rather than infrastructure.
Raw data and model outputs only create business value when stakeholders can understand and act on them. Visualization tools bridge the gap between technical analysis and executive decision-making.
Compare: Elasticsearch vs. Tableau—Elasticsearch provides real-time search and analytics on raw data, while Tableau creates polished visualizations for business consumption. Technical teams use Elasticsearch for operational monitoring; executives consume Tableau dashboards for strategic decisions.
| Concept | Best Examples |
|---|---|
| Batch Processing | Hadoop, Hive, Spark (batch mode) |
| Real-Time Streaming | Kafka, Flink, Storm, Spark Streaming |
| Data Pipeline/Messaging | Kafka |
| Flexible Document Storage | MongoDB, NoSQL databases |
| Search and Log Analytics | Elasticsearch |
| SQL on Big Data | Hive, Spark SQL |
| Machine Learning at Scale | TensorFlow, Spark MLlib |
| Business Visualization | Tableau |
A financial services company needs to detect fraudulent transactions within 100 milliseconds of occurrence. Which two technologies would you recommend for the streaming pipeline, and why might you choose Flink over Spark for this use case?
Compare and contrast Hadoop and Spark for a scenario where a retailer wants to analyze five years of transaction history to identify seasonal purchasing patterns. What's the key tradeoff?
Your company's data schema changes frequently as product features evolve. Which storage technology accommodates this flexibility, and what SQL guarantee does it typically sacrifice?
An FRQ describes a company that needs to: (a) ingest clickstream data in real-time, (b) store it for historical analysis, and (c) enable business analysts to query it with SQL. Map each requirement to the most appropriate technology.
Why might an organization use both Elasticsearch and Tableau in their data stack? What distinct business needs does each address?