Descriptive vs. Inferential Statistics

Descriptive Statistics
- What it studies: Things like average, minimum, maximum, variance, and standard deviation
- Purpose: To summarize and describe the data you already have
Inferential Statistics
- What it studies: Hypothesis testing, confidence intervals, regression
- Purpose: To make guesses or predictions about a bigger group (population) using a smaller group (sample)
Measures of Central Tendency: Mean, Median, Mode


On the left side of the image, we see the monthly income of six people. The average (mean) income is $6,250. Based on this, someone might decide it’s a good area to open a luxury store.
But on the right side, one more person is added; Elon Musk, who earns $10 million a month. This makes the new average income jump to $1.43 million.
That number doesn’t reflect the real income of most people in the area. Musk’s high income is an outlier, and outliers can pull the average way up or down. So using just the average (mean) isn’t always a good way to understand a group, especially when there's one value much higher or lower than the rest.
So when dealing with missing values, we first try to take care of these outliers before imputation.
Range, IQR
IQR looks at the difference(Range) between q3 and q1. This makes it robust against outliers are it considered the middle values. This shows the data spread and anything greater than or less than the upper and lower bonds are considered outliers.



Measures of Dispersion: Variance and Standard Deviation
Both variance and standard deviation (SD) tell us how spread out the numbers are in a data set like daily stock prices.
- Variance shows how much the numbers differ from the average (mean).
- Standard deviation is just the square root of the variance. It tells us the same thing, but in the same unit as the data (like dollars).
Example: Tesla vs Coca-Cola
Tesla (High Variance and SD)
Tesla’s stock might go from:
- $680 → $700 → $630 → $710 → $650
These prices jump up and down a lot, even if the average is around $674. This means:
- High variance
- High standard deviation
Investors call Tesla a volatile or risky stock because it changes value a lot.
Coca-Cola (Low Variance and SD)
Coca-Cola’s stock might go:
- $60 → $61 → $59 → $60 → $61
These prices are very close to each other and to the average (about $60). This means:
- Low variance
- Low standard deviation
Coca-Cola is seen as a stable stock with low risk.



The left table (orange) and right table (green) show two different distributions of yearly income, along with dot plots visualizing them.
Left Table (Orange) – Low Variance & Low Standard Deviation
- The yearly incomes are:
71, 62, 66, 61, 54, 67, 55, 60
- These values are closely around the mean (≈ 62).
- Low spread = Low variance and low standard deviation
- This means the data is consistent and stable, with small deviations from the mean.
Right Table (Green) – High Variance & High Standard Deviation
- The yearly incomes are:
99, 14, 75, 84, 44, 54, 98, 28
- The values are widely spread out — ranging from 14 to 99.
- High spread = High variance and high standard deviation
- This means the data is less consistent, with large deviations from the average.
📊 Summary:
Group | Mean | Spread of Data | Variance & SD |
Left (Orange) | ≈ 62 | Tight, consistent | Low |
Right (Green) | ≈ 62 | Wide, inconsistent | High |
Skewness

Data can be normally distributed, or it can be skewed to the left or right.
- When data is left-skewed (negative skew), most values are high, but a few low numbers pull the "tail" to the left. This can affect how your model learns from the data.
- When data is right-skewed (positive skew), most values are low, but a few very high numbers stretch the "tail" to the right. This also shifts the data away from the average.
How Skewness Affects Modeling:
Impact | Explanation |
1. Biased predictions | Models may learn patterns that are more influenced by the skewed tail. |
2. Poor accuracy | Linear models (like regression) assume normality—skew breaks that assumption. |
3. Wrong feature importance | Outliers in skewed data can make a feature seem more or less important. |
To fix skewed data, we often:
- Look for outliers (extremely high or low values) and treat them so the data becomes more balanced or normally distributed, which helps improve model accuracy.

Normal distribution: data is balanced around the mean.
Quick Recap: Empirical Rule (Normal Distribution)
- 68% of data lies within mean ±1 standard deviation of the mean.
- 95% lies within mean ±2 standard deviations.
- 99.7% lies within mean ±3 standard deviations.
Anything lying outside the ±3 are considered outliers and should be dealt with.

Take the data, pd.series and find the mean and standard deviation of the data. Then calculate the +- standard for 1,2, and 3 and then you can filter it with the data to see the transactions which data points fall within 1, 2 and 3.
Anything which is above 3*std or -3*std are considered outliers.

Here we can see that 3*standard deviation is dragging the left and right upper boundaries. When we take does data points out using our outlier removing technique we get this.

Z-score
This looks at how many standard deviations your values are away from the mean. It has a upper and lower boundary of +3 and -3.
Z-score is easily used to detect outliers and have this formular, where it takes single data points - mean / standard deviation


When dealing with outliers with z-score we just need to find all data points which are > or <+- 3 sd above mean and below mean. This values are considered as outliers.
So you can easily say show me all the data points which have z scores which are -3/+3. This rows are considered as outliers or they are that number of standard deviation away from the mean.
When do i use Zscore and when do i use IQR
When Detecting Outliers
Method | Use When... | Avoid When... | Notes |
Z-score | Data is roughly normally distributed (bell-shaped), and you don't have many extreme outliers | Data is skewed or has many extreme values. This is because Zscore depends on the mean and the mean is affected/dragged in the presence of outliers | Easy to use, but sensitive to outliers because it uses mean and std |
IQR | Data may be skewed or not normally distributed | Very small datasets (may over-flag values) | Robust to outliers, uses median and quartiles (Q1 and Q3). |
Rule of thumb:
Use domain rules first, then IQR if you're unsure. Z-score is fine if your data is normal and clean which might not always be the case.
When Preprocessing for Machine Learning
You're usually scaling your features before feeding them into models. Here's which scaler to use depending on whether your data has outliers:
Scaler | Use When... | Avoid When... | Notes |
StandardScaler | Data is normal, no outliers | Outliers are present | Uses mean and std — sensitive to outliers - mean = 0, std = 1 |
MinMaxScaler | You want data between 0–1 (like for neural networks), and no outliers | There are large or extreme values | Min/max values stretch the scale badly
|
RobustScaler | There are outliers, and you want to minimize their impact | Data has no outliers and you want full range scaling | Uses median and IQR — best choice with outliers |
No scaling | Tree-based models (Random Forest, XGBoost) | Models that rely on distance (e.g. KNN, SVM, Logistic Regression) | Trees are scale-invariant — no need to scale features |
Summary Table
Phase | Best Method if Outliers Present | Why |
Outlier Detection | IQR or domain filtering | More robust than z-score |
Preprocessing for ML | RobustScaler | Not affected by outliers |
ML Model Choice | Tree models (no scaling needed), or scale with RobustScaler for others | Trees are scale-insensitive since the depend on distance. |

You can actually use the description function in pandas to understand your dataset alittle.
Looking at the age and annual income column, i see min values of 1 and 0 which isnt really reasonable. Same applies to the max values and this makes me know there are outliers without even digging deep.
From Law of Large Numbers to Chi-Square Testing
Law of Large Numbers (LLN)
As the sample size grows, the sample mean gets closer to the population mean. The more and more samples you take from your population the means become closer.
Why it matters:
- Justifies why bigger samples give more reliable estimates.
Standard Error (SE)
The standard deviation of the sampling distribution of a statistic (usually the mean).
Formula:

where:
- σ\sigma = population standard deviation. we can use the sample mean if the population mean is not known and the sample size is > 30
- n = sample size
Z-Score & Z-Table
- Used to find p-values and critical values in hypothesis testing.
Confidence Level & Confidence Interval (CL + 1/2 idea)
- Confidence Level (CL): Probability that the interval contains the true parameter (e.g., 95% CL).
Why used:
- Gives a range of plausible values for a population parameter.
- More informative than a single point estimate.
Hypothesis Testing & p-Value
Steps:
- Null hypothesis (H₀): Statement to test (e.g., μ = 100).
- Alternative hypothesis (H₁): What you’re trying to show.
- Choose significance level α (commonly 0.05).
- Calculate test statistic (Z or t).
- Find p-value = probability of observing results as extreme as yours if H₀ is true.
- Decision rule:
- If p <alpha → reject H₀.
- If p > alpha → fail to reject H₀.
Chi-Square Test (χ²)
- Testing relationships between categorical variables (independence test).
- Checking if observed distribution matches expected (goodness-of-fit test).
🔗 How They Connect
Law of Large Numbers → Larger samples give better estimates ↓ Standard Error → Quantifies variability of sample mean ↓ Z-Score & Z-Table → Standardize results and get probabilities ↓ Confidence Interval → Range estimate for population parameters ↓ Hypothesis Testing → Use p-value to make decisions ↓ Chi-Square Test → Special case for categorical data relationships