Statistics
ConceptStatistics - Descriptive Statistics, Distributions, and Inference
Table of Contents
- Descriptive Statistics
- Probability Distributions
- Sampling Distributions
- Estimation
- Hypothesis Testing
- Regression
- Analysis of Variance
- Nonparametric Methods
Descriptive Statistics
Measures of Center
Mean (average):
Median: Middle value (50th percentile) If odd: middle value If even: average of two middle values
Mode: Most frequent value
Trimmed mean: Mean with outliers removed
Measures of Spread
Variance:
Standard deviation:
Range:
Interquartile Range (IQR): ()
Outlier rule: Values outside
Box Plot
- Whiskers extend to min/max within 1.5IQR
- Box from to
- Median line inside box
Standardized Score (z-score)
Chebyshev’s Inequality: At least of data within standard deviations of mean.
Empirical Rule (Normal): 68-95-99.7 within 1, 2, 3 standard deviations.
Probability Distributions
Discrete Distributions
Uniform: for
Bernoulli:
Binomial:
Geometric: Number of trials until first success
Negative Binomial: Number of trials until successes
Hypergeometric: Sampling without replacement total objects, successes, draws:
Poisson: parameter
Continuous Distributions
Uniform: for
Exponential: for Memoryless property:
Normal: Standard normal:
Standardizing:
Gamma:
Chi-square ((\chi^2_\nu)): Gamma with Degrees of freedom parameter
t-distribution: Heavy-tailed, parameter (degrees of freedom) Approaches normal as
F-distribution: (two degrees of freedom)
Central Limit Theorem
If i.i.d. with :
In distribution as
Law of Large Numbers
Sample mean converges to population mean
Almost surely (strong law).
Sampling Distributions
Sample Mean
If i.i.d.:
Standardized:
Sample Variance
Chi-square Distribution
If i.i.d.:
t-distribution
Comparison: Student’s t versus normal (use t when unknown).
Two Samples
Difference of means:
Pooled variance: If :
where
Estimation
Point Estimation
Estimator: Function of sample (random variable)
Estimate: Realization for specific sample
Unbiased:
Consistent: (in probability)
Efficient: Minimum variance among unbiased estimators
Maximum Likelihood Estimation
Likelihood:
MLE: maximizes or equivalently
Invariance: If is MLE for , then is MLE for
Method of Moments
Moment estimator: Equate sample moments to population moments
Confidence Intervals
Interpretation: C% of intervals contain true parameter
Normal, known variance:
Normal, unknown variance:
Proportion:
(For large )
Sample size needed:
where is desired margin of error.
Bootstrapping
Resampling method: Sample with replacement from data
Use empirical distribution as approximate population distribution.
Hypothesis Testing
Hypotheses
: Null hypothesis (status quo) : Alternative hypothesis
Type I error: Reject when true (significance level )
Type II error: Fail to reject when true Power:
Test Statistic
One-sample z-test (normal, known):
One-sample t-test (normal, unknown):
Two-sample t-test:
p-value
Probability of observing statistic at least as extreme assuming true
Decision: Reject if
Interpretation: Smaller p-value is stronger evidence against
Rejection Regions
Two-tailed: Reject if |test statistic| > critical value
One-tailed: Reject if test statistic > critical value (upper) or < -critical value (lower)
t-tests
One-sample, two-sided: vs
Reject if
One-sample, upper: vs
Reject if
Tests for Proportions
One-sample:
Two-sample:
where
Chi-square Tests
Goodness of fit:
where observed, expected, parameters estimated.
Test of independence (contingency table):
where .
Regression
Simple Linear Regression
Model:
Least squares estimates:
Inference on Slope
t-test: vs
where (residual standard error)
Confidence interval:
Confidence and Prediction Intervals
For mean:
For new observation:
Multiple Regression
y =
Matrix form:
Least squares:
Inference: F-test for model, t-tests for individual coefficients
R²: Proportion of variance explained
Analysis of Variance
One-Way ANOVA
Model:
Hypothesis:
F-test:
where:
Test statistic:
ANOVA Table:
| Source | df | SS | MS | F |
|---|---|---|---|---|
| Treatments | k-1 | SSTr | MSTr | MSTr/MSE |
| Error | n-k | SSE | MSE | |
| Total | n-1 | SST |
Two-Way ANOVA
Model:
Tests for main effects and interaction.
Nonparametric Methods
Sign Test
For median :
Test using signs of
Wilcoxon Signed-Rank Test
For paired differences (nonparametric alternative to paired t-test)
Use ranks of |differences|, account for signs.
Wilcoxon Rank-Sum Test
For two independent samples (nonparametric alternative to two-sample t-test)
Mann-Whitney: Sum ranks from one sample
Kruskal-Wallis Test
Multi-sample nonparametric test
Uses ranks, alternative to one-way ANOVA
Supplementary Statistics & Probability Reference (from mathematics_GPT)
Unique Example Summaries
-
Chebyshev’s Inequality: If a data set has a mean of 100 and a standard deviation of 20, at least 75% of the data falls within 60 to 140.
-
Empirical Rule (Normal): For a normal distribution, 68% of the data falls within one standard deviation of the mean, 95% within two, and 99.7% within three.
-
Law of Large Numbers: If you flip a fair coin repeatedly, the proportion of heads will converge to 0.5 as the number of flips increases.
-
Central Limit Theorem: If you repeatedly roll a fair six-sided die, the average of the results will be approximately normally distributed, regardless of the number of rolls.
-
Chi-square Distribution: If you have a large number of independent, identically distributed random variables, their sum will be approximately chi-square distributed.
-
t-distribution: If you take a sample from a normal distribution and calculate the sample mean and sample standard deviation, the ratio of the sample mean to the sample standard deviation divided by the square root of the sample size will be approximately t-distributed.
Mnemonic Tables
-
Probability Distributions:
- Uniform: All outcomes equally likely.
- Bernoulli: Binary outcome (success/failure).
- Binomial: Multiple independent Bernoulli trials.
- Geometric: Number of trials until first success.
- Negative Binomial: Number of trials until r successes.
- Hypergeometric: Sampling without replacement.
- Poisson: Counts of rare events.
-
Continuous Distributions:
- Uniform: Flat line.
- Exponential: Decaying curve.
- Normal: Bell curve.
- Gamma: Right-skewed.
- Chi-square: Right-skewed.
- t-distribution: Heavy-tailed.
- F-distribution: Fatter than normal.
-
Hypothesis Testing:
- One-sample z-test: Normal, σ known.
- One-sample t-test: Normal, σ unknown.
- Two-sample t-test: Two samples, pooled variance.
- Chi-square tests: Goodness of fit, independence.
-
Confidence Intervals:
- Normal, known variance: z-score.
- Normal, unknown variance: t-distribution.
- Proportion: Normal approximation.
-
Regression:
- Simple linear: Least squares, t-test.
- Multiple: Matrix form, F-test.
Practical Rules
-
Sampling:
- For large populations, sampling with replacement is often sufficient.
- For small populations, sampling without replacement is necessary.
-
Hypothesis Testing:
- If the sample size is large (n > 30), z-tests can be used.
- If the sample size is small (n < 30), t-tests are preferred.
- If σ is unknown, use t-tests.
-
Confidence Intervals:
- For large n, z-intervals are accurate.
- For small n, t-intervals are more robust.
- For proportions, use normal approximation for large n.
-
Regression:
- R² measures the proportion of variance explained.
- A high R² does not imply causality.
- Always check residuals for normality and independence.
Next: Probability Theory
Last updated: Comprehensive statistics reference covering descriptive, inferential, and regression methods.