Study smarter with Fiveable
Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.
Data analytics sits at the heart of modern information systems. It's how organizations transform raw data into actionable intelligence. You're being tested on more than just knowing what these techniques are; you need to understand when to apply each method, what type of insight it produces, and how techniques build on each other in a complete analytics workflow. Exams often ask you to recommend the right technique for a given business scenario or explain why one approach works better than another.
These techniques span a logical progression: from preparing data to understanding patterns to making predictions to scaling for complexity. Think of them as tools in a toolkit, each designed for specific problems. Don't just memorize definitions. Know what question each technique answers and what type of data it requires. When you see an FRQ scenario, your job is matching the business problem to the right analytical approach.
Before any analysis can happen, data must be collected, cleaned, and explored. These foundational techniques ensure you're working with reliable information and that you understand its basic characteristics. Garbage in, garbage out applies here: these steps prevent flawed conclusions downstream.
Data collection is the process of gathering raw data from sources like databases, APIs, surveys, and sensors. The data you collect must be relevant to your analysis goals, sufficient in volume, and ethically sourced.
Once collected, data cleaning removes inaccuracies, duplicates, and missing values that would otherwise skew results. For example, if a customer database has 15% of its email fields blank and several duplicate entries, those issues need to be resolved before any meaningful analysis.
Transformation techniques then convert data into formats that algorithms can actually process. Normalization rescales numerical values to a common range (like 0 to 1), and encoding converts categorical data (like "Yes"/"No") into numerical representations.
Data quality is evaluated across four key dimensions: accuracy (is the data correct?), completeness (are values missing?), consistency (do related records agree?), and reliability (can the source be trusted?). Together, these determine whether data can support sound decision-making.
Data profiling is the diagnostic step. It scans datasets to identify quality issues like missing fields, outliers, and format inconsistencies before they corrupt your analysis. Once data is in production, ongoing monitoring ensures integrity is maintained as new information flows into the system.
Descriptive statistics give you a snapshot of your dataset's basic shape. Central tendency measures (, , ) summarize typical values. If the average order value in an e-commerce dataset is , but the median is , that gap tells you a few large orders are pulling the mean upward.
Variability measures like and reveal how spread out data points are. This matters for understanding risk and consistency. A supplier with an average delivery time of 3 days and a standard deviation of 0.5 days is far more reliable than one averaging 3 days with a standard deviation of 2.5 days.
These statistics are the foundation for everything else. You can't build models without first understanding what your data looks like.
Compare: Data Quality Assessment vs. Data Preprocessing: both improve data reliability, but quality assessment diagnoses problems while preprocessing fixes them. FRQs may ask you to distinguish evaluation from action.
Once data is prepared, these techniques help you see what's actually there. Exploration precedes explanation: you need to understand patterns before you can model or predict them.
Graphical representations like bar charts, scatter plots, and heat maps make patterns visible that raw numbers hide. A table of 10,000 sales records is hard to interpret, but a scatter plot of those same records can instantly reveal a correlation between advertising spend and revenue.
Visualization also makes trend and outlier detection intuitive. A single data point far from the cluster in a scatter plot jumps out visually in a way it never would buried in a spreadsheet.
Common industry tools include Tableau (strong for interactive dashboards), Power BI (integrates well with Microsoft ecosystems), and Matplotlib (a Python library for custom, code-driven charts). Know that these exist and what they're generally used for.
EDA is an iterative investigation process that combines statistical summaries and visual methods to uncover structure in your data. Where visualization is a single tool, EDA is the broader workflow that uses visualization alongside descriptive statistics, correlation checks, and distribution analysis.
A key part of EDA is assumption testing: checking whether your data meets the requirements for specific analytical techniques. For instance, linear regression assumes a roughly linear relationship between variables. EDA helps you verify that before you commit to a method.
EDA is also where hypothesis generation happens. It doesn't prove anything on its own, but it tells you what's worth investigating further with more rigorous techniques.
Compare: Data Visualization vs. EDA: visualization is a tool, while EDA is a process that uses visualization alongside statistics. Think of visualization as one instrument in the EDA orchestra.
These techniques move beyond description to explanation and forecasting. They answer "why" and "what next" questions, which are the insights that drive strategic decisions.
Regression models the relationship between a dependent variable (the outcome you care about) and one or more independent variables (the factors that might influence it). The simplest form is linear regression, expressed as , where is the slope and is the y-intercept.
Regression serves two purposes: prediction (estimating future values) and explanation (identifying which factors matter most). If you regress monthly sales against advertising spend, temperature, and day of the week, the model tells you both what sales to expect and which variable has the strongest influence.
Types to know:
Hypothesis testing provides statistical validation for whether observed patterns are real or just random noise. Without it, you might act on a trend that's actually meaningless.
The process works like this:
This technique provides statistical confidence for business decisions rather than relying on intuition alone. For example, before rolling out a new website design company-wide, you'd use hypothesis testing to confirm that the design actually improved conversion rates in an A/B test.
Predictive modeling uses historical data patterns to estimate future outcomes. It's the bridge between analytics and strategy.
The most important concept here is generalization: a model's ability to perform well on new, unseen data, not just the data it was trained on. A model that memorizes its training data but fails on new inputs is overfitting, and it's essentially useless in practice.
Applications span industries: credit risk scoring in banking, customer churn prediction in telecom, demand forecasting in retail, and patient readmission prediction in healthcare.
Compare: Regression Analysis vs. Predictive Modeling: regression is one technique within predictive modeling's broader toolkit. Regression explains relationships; predictive modeling optimizes for accurate forecasts using whatever methods work best (regression, decision trees, neural networks, etc.).
Classification and clustering both organize data into groups, but they work in fundamentally different ways. Classification uses labels you provide; clustering discovers labels on its own.
Classification is a form of supervised learning, meaning the algorithm learns from labeled training examples. You show it data where the correct category is already known, and it learns rules for assigning new data to those same categories.
Common algorithms include decision trees (easy to interpret, good for rule-based decisions), support vector machines (effective for high-dimensional data), and neural networks (powerful for complex patterns but harder to interpret).
Real-world applications include spam filtering (spam vs. not spam), medical diagnosis (disease vs. no disease), and fraud detection (fraudulent vs. legitimate transaction). The common thread: you're sorting items into known, predefined buckets.
Clustering is a form of unsupervised learning. There are no predefined labels. Instead, the algorithm groups similar data points together and lets you interpret what those groups mean.
Clustering is discovery-oriented: market segmentation, customer profiling, and anomaly detection in network traffic are all cases where you don't know the groups ahead of time.
Compare: Classification vs. Clustering: classification asks "which group does this belong to?" while clustering asks "what groups exist?" If an FRQ describes labeled training data, think classification. If it mentions discovering unknown patterns, think clustering.
Some data requires specialized techniques. Time-ordered data and unstructured text each demand approaches designed for their unique characteristics.
Time series analysis identifies temporal patterns in data collected at regular intervals over time: trends (long-term direction), seasonality (repeating cycles), and irregular fluctuations.
Techniques like ARIMA (AutoRegressive Integrated Moving Average) and exponential smoothing are built around a key principle: recent observations typically matter more than distant ones when forecasting. Standard regression doesn't account for this time-dependent structure, which is why time series data needs its own methods.
Forecasting applications include sales projections, stock price estimation, and resource demand planning. Any time the order and timing of data points carry meaning, time series analysis is the right fit.
Text analytics extracts structured meaning from unstructured data like documents, social media posts, and customer feedback. This is a huge deal because the majority of organizational data is unstructured.
Key NLP (Natural Language Processing) techniques include:
The business value is clear: understanding customer opinions at scale (instead of reading thousands of reviews manually), automating document classification, and gathering competitive intelligence from public sources.
Compare: Time Series Analysis vs. Text Analytics: both handle specialized data types, but time series works with structured numerical sequences while text analytics processes unstructured language. Know which technique matches which data format.
These techniques handle complexity, whether that's massive data volumes, pattern discovery across many variables, or systems that improve themselves over time.
Data mining is about pattern discovery in large datasets, combining statistical methods and machine learning techniques to extract knowledge that isn't obvious on the surface.
Key techniques within data mining include:
The goal is knowledge extraction, turning massive data warehouses into strategic insights that inform business decisions.
Machine learning refers to systems that improve from experience without being explicitly programmed for every scenario. Instead of writing rules by hand, you feed the system data and let it learn the rules.
Three paradigms to know:
Machine learning is the foundation for modern AI applications like recommendation engines (Netflix, Spotify), image recognition, and natural language processing.
Big data analytics processes datasets that are too large, too fast-moving, or too varied for traditional database tools to handle. The classic framework describes big data with the "3 Vs": Volume (massive size), Velocity (rapid generation), and Variety (multiple formats).
Technologies like Hadoop (distributed storage and batch processing), Spark (faster, in-memory processing), and NoSQL databases (flexible schemas for unstructured data) enable distributed computing across clusters of machines.
Organizations that can effectively harness big data gain a competitive advantage through faster, more granular, and more comprehensive analysis.
Compare: Data Mining vs. Machine Learning: data mining focuses on discovering patterns in existing data, while machine learning focuses on building models that generalize to new data. Data mining asks "what's in here?"; machine learning asks "what can I predict?"
| Concept | Best Examples |
|---|---|
| Data Preparation | Data Collection/Preprocessing, Data Quality Assessment |
| Understanding Data Shape | Descriptive Statistics, Data Visualization, EDA |
| Relationship Modeling | Regression Analysis, Hypothesis Testing |
| Future Prediction | Predictive Modeling, Time Series Analysis |
| Supervised Categorization | Classification Techniques |
| Unsupervised Grouping | Clustering Algorithms, Data Mining |
| Specialized Data Handling | Text Analytics, Time Series Analysis |
| Scale and Automation | Big Data Analytics, Machine Learning |
A retail company wants to identify which customers are most likely to stop purchasing. Which technique would you recommend, and why is it better suited than clustering for this problem?
Compare and contrast classification and clustering: What type of learning does each represent, and what kind of business question does each answer?
An analyst notices their predictive model performs perfectly on training data but poorly on new data. What concept does this illustrate, and which technique category would help diagnose the underlying data issues?
A marketing team wants to understand what topics customers discuss most in product reviews. Which technique applies here, and how does it differ from traditional descriptive statistics?
If an FRQ presents a scenario with sales data collected monthly over five years and asks for a forecast, which technique category is most appropriate? What makes time-dependent data require specialized methods?