Why This Matters
Data analytics sits at the heart of modern information systems—it's how organizations transform raw data into actionable intelligence. You're being tested on more than just knowing what these techniques are; you need to understand when to apply each method, what type of insight it produces, and how techniques build on each other in a complete analytics workflow. Exams often ask you to recommend the right technique for a given business scenario or explain why one approach works better than another.
These techniques span a logical progression: from preparing data to understanding patterns to making predictions to scaling for complexity. Think of them as tools in a toolkit—each designed for specific problems. Don't just memorize definitions—know what question each technique answers and what type of data it requires. When you see an FRQ scenario, your job is matching the business problem to the right analytical approach.
Preparing and Understanding Your Data
Before any analysis can happen, data must be collected, cleaned, and explored. These foundational techniques ensure you're working with reliable information and understand its basic characteristics. Garbage in, garbage out—these steps prevent flawed conclusions downstream.
Data Collection and Preprocessing
- Raw data gathering from databases, APIs, surveys, and sensors—must be relevant, sufficient, and ethically sourced for the analysis goals
- Data cleaning removes inaccuracies, duplicates, and missing values that would otherwise skew results and produce misleading insights
- Transformation techniques like normalization and encoding convert data into formats algorithms can process effectively
Data Quality Assessment
- Accuracy, completeness, consistency, and reliability—the four dimensions that determine whether data can be trusted for decision-making
- Data profiling identifies quality issues like missing fields, outliers, and format inconsistencies before they corrupt analysis
- Ongoing monitoring ensures data integrity is maintained as new information flows into the system
Descriptive Statistics
- Central tendency measures (mean, median, mode) summarize typical values in your dataset
- Variability measures like standard deviation and range reveal how spread out data points are—critical for understanding risk and consistency
- Foundation for all other analysis—you can't build models without first understanding your data's basic shape
Compare: Data Quality Assessment vs. Data Preprocessing—both improve data reliability, but quality assessment diagnoses problems while preprocessing fixes them. FRQs may ask you to distinguish evaluation from action.
Exploring and Visualizing Patterns
Once data is prepared, these techniques help you see what's actually there. Exploration precedes explanation—you need to understand patterns before you can model or predict them.
Data Visualization
- Graphical representations (bar charts, scatter plots, heat maps) make patterns visible that raw numbers hide
- Trend and outlier detection becomes intuitive when data is displayed visually rather than in tables
- Tools like Tableau, Power BI, and Matplotlib are industry standards—know their strengths for different visualization needs
Exploratory Data Analysis (EDA)
- Iterative investigation of datasets using both statistical summaries and visual methods to uncover structure
- Assumption testing reveals whether your data meets requirements for specific analytical techniques
- Hypothesis generation—EDA doesn't prove things, but it tells you what's worth investigating further
Compare: Data Visualization vs. EDA—visualization is a tool, while EDA is a process that uses visualization alongside statistics. Think of visualization as one instrument in the EDA orchestra.
Finding Relationships and Making Predictions
These techniques move beyond description to explanation and forecasting. They answer "why" and "what next" questions—the insights that drive strategic decisions.
Regression Analysis
- Models relationships between a dependent variable and one or more independent variables using equations like y=mx+b
- Prediction and explanation—tells you both what will happen and which factors matter most
- Types include linear, logistic, and polynomial—choose based on whether your outcome is continuous or categorical
Hypothesis Testing
- Statistical validation determines whether observed patterns are real or just random chance
- Null and alternative hypotheses frame the question; p-values and significance levels (typically α=0.05) determine the answer
- Decision support—provides statistical confidence for business choices rather than gut feelings
Predictive Modeling
- Future outcome estimation using historical data patterns—the bridge between analytics and strategy
- Model generalization matters more than training accuracy; a model that only works on past data is useless
- Applications span industries: credit risk scoring, customer churn prediction, demand forecasting
Compare: Regression Analysis vs. Predictive Modeling—regression is one technique within predictive modeling's broader toolkit. Regression explains relationships; predictive modeling optimizes for accurate forecasts using whatever methods work best.
Grouping and Categorizing Data
Classification and clustering both organize data into groups, but they work fundamentally differently. Classification uses labels you provide; clustering discovers labels on its own.
Classification Techniques
- Supervised learning assigns data to predefined categories based on labeled training examples
- Algorithms include decision trees, support vector machines, and neural networks—each with different strengths for different data types
- Real-world applications: spam filtering, medical diagnosis, fraud detection—anywhere you need to sort items into known buckets
Clustering Algorithms
- Unsupervised learning groups similar data points without predefined labels—the algorithm finds natural groupings
- K-means, hierarchical clustering, and DBSCAN are common methods with different assumptions about cluster shapes
- Discovery-oriented: market segmentation, customer profiling, anomaly detection in network traffic
Compare: Classification vs. Clustering—classification asks "which group does this belong to?" while clustering asks "what groups exist?" If an FRQ describes labeled training data, think classification; if it mentions discovering unknown patterns, think clustering.
Handling Specialized Data Types
Some data requires specialized techniques. Time-ordered data and unstructured text each demand approaches designed for their unique characteristics.
Time Series Analysis
- Temporal pattern recognition identifies trends, seasonality, and cycles in data collected over time
- Techniques like ARIMA and exponential smoothing account for the fact that recent observations matter more than distant ones
- Forecasting applications: sales projections, stock prices, resource demand planning
Text Analytics
- Unstructured data processing extracts meaning from documents, social media, and customer feedback
- NLP techniques include sentiment analysis, topic modeling, and named entity recognition
- Business value: understanding customer opinions at scale, automating document classification, competitive intelligence
Compare: Time Series Analysis vs. Text Analytics—both handle specialized data types, but time series works with structured numerical sequences while text analytics processes unstructured language. Know which technique matches which data format.
Scaling Up: Advanced Analytics Approaches
These techniques handle complexity—whether that's massive data volumes, pattern discovery across variables, or systems that improve themselves over time.
Data Mining
- Pattern discovery in large datasets using statistical and machine learning methods combined
- Techniques include association rules (what items are purchased together), anomaly detection, and clustering
- Knowledge extraction—turns data warehouses into strategic insights
Machine Learning Basics
- Systems that improve from experience without explicit programming for every scenario
- Three paradigms: supervised (labeled data), unsupervised (pattern discovery), reinforcement (learning from feedback)
- Foundation for modern AI applications: recommendation engines, image recognition, natural language processing
Big Data Analytics
- Massive scale processing of datasets too large for traditional database tools
- Technologies like Hadoop, Spark, and NoSQL databases enable distributed computing across clusters
- Competitive advantage—organizations that harness big data outperform those that can't
Compare: Data Mining vs. Machine Learning—data mining focuses on discovering patterns in existing data, while machine learning focuses on building models that generalize to new data. Data mining asks "what's in here?"; machine learning asks "what can I predict?"
Quick Reference Table
|
| Data Preparation | Data Collection/Preprocessing, Data Quality Assessment |
| Understanding Data Shape | Descriptive Statistics, Data Visualization, EDA |
| Relationship Modeling | Regression Analysis, Hypothesis Testing |
| Future Prediction | Predictive Modeling, Time Series Analysis |
| Supervised Categorization | Classification Techniques |
| Unsupervised Grouping | Clustering Algorithms, Data Mining |
| Specialized Data Handling | Text Analytics, Time Series Analysis |
| Scale and Automation | Big Data Analytics, Machine Learning |
Self-Check Questions
-
A retail company wants to identify which customers are most likely to stop purchasing. Which technique would you recommend, and why is it better suited than clustering for this problem?
-
Compare and contrast classification and clustering: What type of learning does each represent, and what kind of business question does each answer?
-
An analyst notices their predictive model performs perfectly on training data but poorly on new data. What concept does this illustrate, and which technique category would help diagnose the underlying data issues?
-
A marketing team wants to understand what topics customers discuss most in product reviews. Which technique applies here, and how does it differ from traditional descriptive statistics?
-
If an FRQ presents a scenario with sales data collected monthly over five years and asks for a forecast, which technique category is most appropriate? What makes time-dependent data require specialized methods?