📈Preparatory Statistics Unit 1 – Intro to Statistics & Data Analysis
Statistics is the science of collecting, analyzing, and interpreting data to make informed decisions. It involves key concepts like populations, samples, parameters, and statistics, as well as methods for organizing and summarizing data through descriptive and inferential techniques.
Data can be quantitative or qualitative, with various subtypes. Descriptive statistics summarize data using measures of central tendency, variability, and position. Probability concepts underpin statistical inference, which uses sample data to draw conclusions about populations.
Statistics involves collecting, analyzing, interpreting, and presenting data to make informed decisions
Population refers to the entire group of individuals, objects, or events of interest
Sample is a subset of the population used to draw conclusions about the whole population
Parameter is a numerical summary measure that describes a characteristic of a population
Statistic is a numerical summary measure computed from sample data used to estimate a population parameter
Descriptive statistics involves methods for organizing, summarizing, and presenting data
Inferential statistics involves methods for using sample data to make estimates, decisions, predictions, or other generalizations about a population
Types of Data and Variables
Variables can be classified as either quantitative (numerical) or qualitative (categorical)
Quantitative variables take on numerical values and can be further classified as discrete (countable values) or continuous (measurable values)
Examples of discrete variables include the number of siblings or the number of cars owned
Examples of continuous variables include height, weight, or temperature
Qualitative variables take on non-numerical values that can be classified into categories or groups
Nominal variables have categories with no natural ordering (eye color, gender, race)
Ordinal variables have categories with a natural ordering (education level, income bracket, survey ratings)
Descriptive Statistics
Measures of central tendency describe the center or typical value of a dataset
Mean is the arithmetic average of a set of values
Median is the middle value when the data is arranged in order
Mode is the most frequently occurring value
Measures of variability describe the spread or dispersion of a dataset
Range is the difference between the maximum and minimum values
Variance measures the average squared deviation from the mean
Standard deviation is the square root of the variance
Measures of position describe the relative standing of a value within a dataset
Percentiles indicate the percentage of values that fall below a given value
Quartiles divide the data into four equal parts (Q1, Q2 or median, Q3)
Z-scores measure the number of standard deviations a value is from the mean
Probability Basics
Probability is a numerical measure of the likelihood that an event will occur
Classical probability is used when outcomes are equally likely and is calculated as the number of favorable outcomes divided by the total number of possible outcomes
Empirical probability is based on historical data or observations and is calculated as the relative frequency of an event
The complement of an event A, denoted as A', is the event "not A" or everything in the sample space that is not included in A
The addition rule for mutually exclusive events states that P(A or B)=P(A)+P(B)
The multiplication rule for independent events states that P(A and B)=P(A)⋅P(B)
Conditional probability is the probability of an event A occurring given that another event B has already occurred, denoted as P(A∣B)
Sampling and Data Collection
Simple random sampling selects a sample such that every possible sample of the same size has an equal chance of being selected
Stratified sampling divides the population into homogeneous subgroups (strata) and then takes a simple random sample from each stratum
Cluster sampling divides the population into clusters, randomly selects some of the clusters, and then samples all individuals within the chosen clusters
Systematic sampling selects individuals from a list by starting at a random point and then selecting every kth element thereafter
Convenience sampling selects individuals who are easily accessible or available, but may not be representative of the population
Observational studies observe individuals and measure variables of interest without attempting to influence the responses
Experiments deliberately impose some treatment on individuals to observe their responses
The independent variable (explanatory variable) is the variable that is manipulated or controlled in an experiment
The dependent variable (response variable) is the variable that is measured or observed in an experiment
Visualizing Data
Bar charts display the distribution of a categorical variable using vertical or horizontal bars
Pie charts display the relative frequencies of categories as slices of a circle
Histograms divide the range of a quantitative variable into intervals (bins) and display the frequency or relative frequency of observations in each interval using vertical bars
Stem-and-leaf plots display the individual values of a quantitative variable by splitting each value into a "stem" (leading digit(s)) and a "leaf" (trailing digit)
Scatterplots display the relationship between two quantitative variables by plotting ordered pairs (x, y) as points in the coordinate plane
Time-series plots display the values of a variable over time, with time on the horizontal axis and the variable of interest on the vertical axis
Box plots (box-and-whisker plots) display the distribution of a quantitative variable using five summary statistics (minimum, Q1, median, Q3, maximum)
Statistical Inference
Point estimation uses sample statistics to estimate population parameters
A point estimator is a formula or method that produces a single value as an estimate of a population parameter
An unbiased estimator has an expected value equal to the true value of the parameter being estimated
Interval estimation uses sample data to construct an interval of plausible values for a population parameter
A confidence interval is an range of values that is likely to contain the true value of a population parameter with a certain level of confidence (e.g., 95%)
The margin of error is the maximum expected difference between the point estimate and the true value of the parameter
Hypothesis testing is a procedure for using sample data to test a claim or conjecture about a population parameter
The null hypothesis (H0) is a statement of "no effect" or "no difference" that is assumed to be true unless there is strong evidence against it
The alternative hypothesis (Ha) is a statement that contradicts the null hypothesis and is accepted if there is sufficient evidence against H0
The significance level (α) is the probability of rejecting H0 when it is actually true (Type I error)
The p-value is the probability of obtaining a sample statistic as extreme as the one observed, assuming that H0 is true
If the p-value is less than α, we reject H0 and conclude that there is sufficient evidence to support Ha
Real-World Applications
Quality control uses statistical methods to monitor and maintain the quality of products or services
Control charts are used to detect unusual variations in a process over time
Acceptance sampling involves taking a random sample from a batch of items and deciding whether to accept or reject the entire batch based on the number of defective items in the sample
Market research uses surveys, focus groups, and other methods to gather data about consumer preferences, opinions, and behaviors
Medical research uses statistical methods to design and analyze clinical trials, epidemiological studies, and other types of health-related research
Randomized controlled trials randomly assign participants to treatment and control groups to assess the effectiveness of a drug, therapy, or other intervention
Observational studies, such as cohort studies and case-control studies, investigate the relationship between risk factors and health outcomes
Social sciences, such as psychology, sociology, and political science, use statistical methods to study human behavior, attitudes, and interactions
Surveys and polls are used to gather data from a representative sample of a population
Regression analysis is used to model the relationship between variables and make predictions
Business analytics uses statistical methods to analyze data and inform decision-making in areas such as finance, marketing, and operations management
Time series analysis is used to model and forecast future values of a variable based on its past values
A/B testing compares two or more versions of a website, app, or marketing campaign to determine which performs better