Preparatory Statistics

📈Preparatory Statistics Unit 1 – Intro to Statistics & Data Analysis

Statistics is the science of collecting, analyzing, and interpreting data to make informed decisions. It involves key concepts like populations, samples, parameters, and statistics, as well as methods for organizing and summarizing data through descriptive and inferential techniques. Data can be quantitative or qualitative, with various subtypes. Descriptive statistics summarize data using measures of central tendency, variability, and position. Probability concepts underpin statistical inference, which uses sample data to draw conclusions about populations.

Key Concepts and Terminology

  • Statistics involves collecting, analyzing, interpreting, and presenting data to make informed decisions
  • Population refers to the entire group of individuals, objects, or events of interest
  • Sample is a subset of the population used to draw conclusions about the whole population
  • Parameter is a numerical summary measure that describes a characteristic of a population
  • Statistic is a numerical summary measure computed from sample data used to estimate a population parameter
  • Descriptive statistics involves methods for organizing, summarizing, and presenting data
  • Inferential statistics involves methods for using sample data to make estimates, decisions, predictions, or other generalizations about a population

Types of Data and Variables

  • Variables can be classified as either quantitative (numerical) or qualitative (categorical)
  • Quantitative variables take on numerical values and can be further classified as discrete (countable values) or continuous (measurable values)
    • Examples of discrete variables include the number of siblings or the number of cars owned
    • Examples of continuous variables include height, weight, or temperature
  • Qualitative variables take on non-numerical values that can be classified into categories or groups
    • Nominal variables have categories with no natural ordering (eye color, gender, race)
    • Ordinal variables have categories with a natural ordering (education level, income bracket, survey ratings)

Descriptive Statistics

  • Measures of central tendency describe the center or typical value of a dataset
    • Mean is the arithmetic average of a set of values
    • Median is the middle value when the data is arranged in order
    • Mode is the most frequently occurring value
  • Measures of variability describe the spread or dispersion of a dataset
    • Range is the difference between the maximum and minimum values
    • Variance measures the average squared deviation from the mean
    • Standard deviation is the square root of the variance
  • Measures of position describe the relative standing of a value within a dataset
    • Percentiles indicate the percentage of values that fall below a given value
    • Quartiles divide the data into four equal parts (Q1, Q2 or median, Q3)
    • Z-scores measure the number of standard deviations a value is from the mean

Probability Basics

  • Probability is a numerical measure of the likelihood that an event will occur
  • Classical probability is used when outcomes are equally likely and is calculated as the number of favorable outcomes divided by the total number of possible outcomes
  • Empirical probability is based on historical data or observations and is calculated as the relative frequency of an event
  • The complement of an event A, denoted as A', is the event "not A" or everything in the sample space that is not included in A
  • The addition rule for mutually exclusive events states that P(A or B)=P(A)+P(B)P(A \text{ or } B) = P(A) + P(B)
  • The multiplication rule for independent events states that P(A and B)=P(A)P(B)P(A \text{ and } B) = P(A) \cdot P(B)
  • Conditional probability is the probability of an event A occurring given that another event B has already occurred, denoted as P(AB)P(A|B)

Sampling and Data Collection

  • Simple random sampling selects a sample such that every possible sample of the same size has an equal chance of being selected
  • Stratified sampling divides the population into homogeneous subgroups (strata) and then takes a simple random sample from each stratum
  • Cluster sampling divides the population into clusters, randomly selects some of the clusters, and then samples all individuals within the chosen clusters
  • Systematic sampling selects individuals from a list by starting at a random point and then selecting every kth element thereafter
  • Convenience sampling selects individuals who are easily accessible or available, but may not be representative of the population
  • Observational studies observe individuals and measure variables of interest without attempting to influence the responses
  • Experiments deliberately impose some treatment on individuals to observe their responses
    • The independent variable (explanatory variable) is the variable that is manipulated or controlled in an experiment
    • The dependent variable (response variable) is the variable that is measured or observed in an experiment

Visualizing Data

  • Bar charts display the distribution of a categorical variable using vertical or horizontal bars
  • Pie charts display the relative frequencies of categories as slices of a circle
  • Histograms divide the range of a quantitative variable into intervals (bins) and display the frequency or relative frequency of observations in each interval using vertical bars
  • Stem-and-leaf plots display the individual values of a quantitative variable by splitting each value into a "stem" (leading digit(s)) and a "leaf" (trailing digit)
  • Scatterplots display the relationship between two quantitative variables by plotting ordered pairs (x, y) as points in the coordinate plane
  • Time-series plots display the values of a variable over time, with time on the horizontal axis and the variable of interest on the vertical axis
  • Box plots (box-and-whisker plots) display the distribution of a quantitative variable using five summary statistics (minimum, Q1, median, Q3, maximum)

Statistical Inference

  • Point estimation uses sample statistics to estimate population parameters
    • A point estimator is a formula or method that produces a single value as an estimate of a population parameter
    • An unbiased estimator has an expected value equal to the true value of the parameter being estimated
  • Interval estimation uses sample data to construct an interval of plausible values for a population parameter
    • A confidence interval is an range of values that is likely to contain the true value of a population parameter with a certain level of confidence (e.g., 95%)
    • The margin of error is the maximum expected difference between the point estimate and the true value of the parameter
  • Hypothesis testing is a procedure for using sample data to test a claim or conjecture about a population parameter
    • The null hypothesis (H0H_0) is a statement of "no effect" or "no difference" that is assumed to be true unless there is strong evidence against it
    • The alternative hypothesis (HaH_a) is a statement that contradicts the null hypothesis and is accepted if there is sufficient evidence against H0H_0
    • The significance level (α\alpha) is the probability of rejecting H0H_0 when it is actually true (Type I error)
    • The p-value is the probability of obtaining a sample statistic as extreme as the one observed, assuming that H0H_0 is true
    • If the p-value is less than α\alpha, we reject H0H_0 and conclude that there is sufficient evidence to support HaH_a

Real-World Applications

  • Quality control uses statistical methods to monitor and maintain the quality of products or services
    • Control charts are used to detect unusual variations in a process over time
    • Acceptance sampling involves taking a random sample from a batch of items and deciding whether to accept or reject the entire batch based on the number of defective items in the sample
  • Market research uses surveys, focus groups, and other methods to gather data about consumer preferences, opinions, and behaviors
  • Medical research uses statistical methods to design and analyze clinical trials, epidemiological studies, and other types of health-related research
    • Randomized controlled trials randomly assign participants to treatment and control groups to assess the effectiveness of a drug, therapy, or other intervention
    • Observational studies, such as cohort studies and case-control studies, investigate the relationship between risk factors and health outcomes
  • Social sciences, such as psychology, sociology, and political science, use statistical methods to study human behavior, attitudes, and interactions
    • Surveys and polls are used to gather data from a representative sample of a population
    • Regression analysis is used to model the relationship between variables and make predictions
  • Business analytics uses statistical methods to analyze data and inform decision-making in areas such as finance, marketing, and operations management
    • Time series analysis is used to model and forecast future values of a variable based on its past values
    • A/B testing compares two or more versions of a website, app, or marketing campaign to determine which performs better


© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.

© 2024 Fiveable Inc. All rights reserved.
AP® and SAT® are trademarks registered by the College Board, which is not affiliated with, and does not endorse this website.