upgrade
upgrade

🪓Data Journalism

Crucial Data Analysis Methods

Study smarter with Fiveable

Get study guides, practice questions, and cheatsheets for all your subjects. Join 500,000+ students with a 96% pass rate.

Get Started

Why This Matters

Data journalism lives or dies on the strength of your analysis. You're not just collecting numbers—you're being tested on your ability to clean messy datasets, find meaningful patterns, and communicate insights clearly. Every method in this guide serves a specific purpose in the data journalism workflow, from ensuring your data is trustworthy to revealing the story hidden within it.

The methods here break down into distinct categories: preparing your data, understanding what you have, finding relationships and patterns, and communicating your findings responsibly. Don't just memorize technique names—know when to use each method and what question it answers. A regression analysis won't help you if your data hasn't been cleaned first, and a beautiful visualization means nothing if you've violated ethical guidelines to obtain your data.


Data Preparation and Collection

Before any analysis can begin, you need reliable data in a usable format. Garbage in, garbage out—the quality of your journalism depends entirely on the integrity of your source material.

Data Cleaning and Preprocessing

  • Identifies and corrects errors—including typos, formatting inconsistencies, and impossible values that would skew your analysis
  • Handles missing values and duplicates through deletion, imputation, or flagging, depending on how the gaps affect your story
  • Standardizes formats across datasets, essential when merging sources from different organizations or time periods

Data Scraping and Web Mining

  • Extracts data from websites using web crawlers, APIs, or manual collection when no structured dataset exists
  • Enables original reporting by gathering information that hasn't been compiled before—think campaign finance records or public meeting minutes
  • Requires technical skills in tools like Python's BeautifulSoup or Scrapy, plus understanding of a site's terms of service

Compare: Data cleaning vs. data scraping—cleaning fixes problems in data you already have, while scraping creates new datasets from online sources. Both are pre-analysis steps, but scraping raises additional legal and ethical questions about permission and attribution.


Understanding Your Data

Once your data is clean, you need to understand its basic shape and characteristics. Descriptive methods tell you what you have; exploratory methods help you figure out what questions to ask.

Descriptive Statistics

  • Summarizes datasets numerically using measures like mean, median, mode, variance, and standard deviation
  • Reveals central tendencies and spread—is your data clustered tightly or widely dispersed? Is the average skewed by outliers?
  • Provides baseline context that readers need before you can make meaningful comparisons or claims

Exploratory Data Analysis (EDA)

  • Uncovers patterns visually through histograms, box plots, scatter plots, and other diagnostic graphics
  • Generates hypotheses rather than testing them—EDA helps you notice that crime spikes on weekends before you investigate why
  • Identifies outliers and anomalies that might represent errors, interesting exceptions, or the heart of your story

Compare: Descriptive statistics vs. EDA—descriptive stats give you numbers (the median income is 52,00052,000), while EDA shows you shapes and patterns (income distribution is bimodal with peaks at 30,00030,000 and 80,00080,000). Use both together for a complete picture.


Finding Relationships and Patterns

This is where data journalism gets powerful. These methods help you move from "what happened" to "why it happened" and "what might happen next."

Correlation Analysis

  • Measures relationship strength between two variables using coefficients like Pearson's rr, which ranges from 1-1 to +1+1
  • Identifies potential connections worth investigating—a high correlation between poverty rates and health outcomes suggests a story
  • Does not prove causationthis distinction is critical for accurate reporting and will appear on exams

Regression Analysis

  • Predicts outcomes by modeling how independent variables influence a dependent variable
  • Quantifies relationships so you can say "for every 1,0001,000 increase in income, life expectancy rises by X years"
  • Includes multiple types: linear regression for continuous outcomes, logistic regression for yes/no outcomes, multiple regression for complex models

Time Series Analysis

  • Analyzes data over intervals to identify trends, cycles, and seasonal patterns in everything from stock prices to crime rates
  • Enables forecasting using techniques like moving averages and ARIMA models to project future values
  • Contextualizes current events—is this month's unemployment rate unusual, or part of a predictable seasonal pattern?

Compare: Correlation vs. regression—correlation tells you that two variables move together; regression tells you how much one variable changes when another changes. If an exam asks you to predict an outcome, regression is your tool.

Hypothesis Testing

  • Tests claims statistically by comparing observed data against a null hypothesis (the assumption that nothing interesting is happening)
  • Uses p-values to assess significance—a p-value below 0.050.05 typically indicates results unlikely due to chance alone
  • Validates or challenges assumptions in public policy claims, scientific studies, and official statistics

Statistical Significance and P-Values

  • Determines whether results are meaningful or could have occurred randomly in your sample
  • Requires careful interpretation—statistical significance doesn't equal practical importance or newsworthiness
  • Commonly misunderstood, making it a frequent source of errors in news reporting and a key concept to master

Compare: Hypothesis testing vs. p-values—hypothesis testing is the overall framework (formulate hypotheses, collect data, make decisions), while p-values are the specific metric used to make those decisions. Know both the process and the number.


Analyzing Unstructured and Specialized Data

Not all data comes in neat spreadsheets. These methods handle text, locations, and network relationships—increasingly important as journalism expands beyond traditional datasets.

Text Mining and Natural Language Processing (NLP)

  • Extracts insights from unstructured text like documents, emails, speeches, and social media posts
  • Automates analysis at scale using techniques like tokenization, stemming, and named entity recognition
  • Powers document-based investigations such as analyzing thousands of court records or leaked communications

Sentiment Analysis

  • Determines emotional tone in text data, classifying content as positive, negative, or neutral
  • Monitors public opinion across social media, customer reviews, and comment sections
  • Combines NLP with machine learning to process volumes of text no human could read manually

Network Analysis

  • Maps relationships between entities using graph theory concepts like nodes (people, organizations) and edges (connections)
  • Identifies influential actors and community structures within social networks, corporate boards, or criminal organizations
  • Reveals hidden patterns in who knows whom, who funds whom, and how information or money flows

Geospatial Analysis

  • Examines location-based data to find spatial patterns in everything from disease outbreaks to housing discrimination
  • Uses GIS tools for mapping, spatial statistics, and overlay analysis that reveals geographic relationships
  • Answers "where" questions that spreadsheets alone cannot—why are toxic facilities clustered in certain neighborhoods?

Compare: Network analysis vs. geospatial analysis—both reveal hidden structures, but network analysis maps relationships (who connects to whom) while geospatial analysis maps locations (where things happen). Some investigations use both—tracking how a rumor spreads through a social network and across geographic regions.


Communication and Responsibility

Your analysis means nothing if you can't communicate it clearly and ethically. These final methods bridge the gap between finding insights and sharing them responsibly.

Data Visualization Techniques

  • Transforms complex findings into visual formats including charts, graphs, maps, and interactive dashboards
  • Enhances comprehension and engagement by making patterns visible that would be invisible in tables of numbers
  • Requires design choices that can either clarify or mislead—axis scales, color choices, and chart types all affect interpretation

Data Ethics and Privacy Considerations

  • Governs responsible data practices from collection through publication, including informed consent and data protection
  • Addresses legal frameworks like GDPR and institutional review requirements that constrain what data you can use
  • Protects sources and subjects while maintaining transparency about your methods—a core tension in data journalism

Compare: Visualization vs. ethics—visualization asks "how do I show this clearly?" while ethics asks "should I show this at all?" A perfectly designed graphic can still be irresponsible if it reveals private information or misleads through selective presentation.


Quick Reference Table

ConceptBest Examples
Data preparationData cleaning, data scraping
Understanding distributionsDescriptive statistics, EDA
Finding relationshipsCorrelation analysis, regression analysis
Analyzing change over timeTime series analysis
Testing claimsHypothesis testing, statistical significance/p-values
Unstructured dataText mining/NLP, sentiment analysis
Specialized data typesNetwork analysis, geospatial analysis
Communication and responsibilityData visualization, data ethics

Self-Check Questions

  1. Which two methods would you use together to investigate whether a relationship you've noticed between variables is statistically meaningful—and in what order?

  2. You've obtained a massive trove of leaked emails. Which methods from this guide would help you analyze them, and what different questions would each method answer?

  3. Compare and contrast correlation analysis and regression analysis. When would you use each, and what's the critical limitation they share?

  4. A city official claims crime has dropped 15% this year. Which methods would help you evaluate whether this claim is meaningful, accounting for both statistical significance and seasonal patterns?

  5. You're building an investigation into which neighborhoods receive the least city investment. Which two specialized analysis methods would be most valuable, and how would you combine their insights?