Maths & Stats

Descriptive vs. Inferential Statistics

Descriptive Statistics

What it studies: Things like average, minimum, maximum, variance, and standard deviation

Purpose: To summarize and describe the data you already have

Inferential Statistics

What it studies: Hypothesis testing, confidence intervals, regression

Purpose: To make guesses or predictions about a bigger group (population) using a smaller group (sample)

Measures of Central Tendency: Mean, Median, Mode

On the left side of the image, we see the monthly income of six people. The average (mean) income is $6,250. Based on this, someone might decide it’s a good area to open a luxury store.

But on the right side, one more person is added; Elon Musk, who earns $10 million a month. This makes the new average income jump to $1.43 million.

That number doesn’t reflect the real income of most people in the area. Musk’s high income is an outlier, and outliers can pull the average way up or down. So using just the average (mean) isn’t always a good way to understand a group, especially when there's one value much higher or lower than the rest.

So when dealing with missing values, we first try to take care of these outliers before imputation.

Range, IQR

IQR looks at the difference(Range) between q3 and q1. This makes it robust against outliers are it considered the middle values. This shows the data spread and anything greater than or less than the upper and lower bonds are considered outliers.

Measures of Dispersion: Variance and Standard Deviation

Both variance and standard deviation (SD) tell us how spread out the numbers are in a data set like daily stock prices.

Variance shows how much the numbers differ from the average (mean).

Standard deviation is just the square root of the variance. It tells us the same thing, but in the same unit as the data (like dollars).

Example: Tesla vs Coca-Cola

Tesla (High Variance and SD)

Tesla’s stock might go from:

$680 → $700 → $630 → $710 → $650

These prices jump up and down a lot, even if the average is around $674. This means:

High variance

High standard deviation

Investors call Tesla a volatile or risky stock because it changes value a lot.

Coca-Cola (Low Variance and SD)

Coca-Cola’s stock might go:

$60 → $61 → $59 → $60 → $61

These prices are very close to each other and to the average (about $60). This means:

Low variance

Low standard deviation

Coca-Cola is seen as a stable stock with low risk.

The left table (orange) and right table (green) show two different distributions of yearly income, along with dot plots visualizing them.

Left Table (Orange) – Low Variance & Low Standard Deviation

The yearly incomes are:

71, 62, 66, 61, 54, 67, 55, 60

These values are closely around the mean (≈ 62).

Low spread = Low variance and low standard deviation

This means the data is consistent and stable, with small deviations from the mean.

Right Table (Green) – High Variance & High Standard Deviation

The yearly incomes are:

99, 14, 75, 84, 44, 54, 98, 28

The values are widely spread out — ranging from 14 to 99.

High spread = High variance and high standard deviation

This means the data is less consistent, with large deviations from the average.

📊 Summary:

Group	Mean	Spread of Data	Variance & SD
Left (Orange)	≈ 62	Tight, consistent	Low
Right (Green)	≈ 62	Wide, inconsistent	High

Skewness

Data can be normally distributed, or it can be skewed to the left or right.

When data is left-skewed (negative skew), most values are high, but a few low numbers pull the "tail" to the left. This can affect how your model learns from the data.

When data is right-skewed (positive skew), most values are low, but a few very high numbers stretch the "tail" to the right. This also shifts the data away from the average.

How Skewness Affects Modeling:

Impact	Explanation
1. Biased predictions	Models may learn patterns that are more influenced by the skewed tail.
2. Poor accuracy	Linear models (like regression) assume normality—skew breaks that assumption.
3. Wrong feature importance	Outliers in skewed data can make a feature seem more or less important.

To fix skewed data, we often:

Look for outliers (extremely high or low values) and treat them so the data becomes more balanced or normally distributed, which helps improve model accuracy.

Normal distribution: data is balanced around the mean.

Quick Recap: Empirical Rule (Normal Distribution)

68% of data lies within mean ±1 standard deviation of the mean.

95% lies within mean ±2 standard deviations.

99.7% lies within mean ±3 standard deviations.

Anything lying outside the ±3 are considered outliers and should be dealt with.

Take the data, pd.series and find the mean and standard deviation of the data. Then calculate the +- standard for 1,2, and 3 and then you can filter it with the data to see the transactions which data points fall within 1, 2 and 3.

Anything which is above 3*std or -3*std are considered outliers.

Here we can see that 3*standard deviation is dragging the left and right upper boundaries. When we take does data points out using our outlier removing technique we get this.

Z-score

This looks at how many standard deviations your values are away from the mean. It has a upper and lower boundary of +3 and -3.

Z-score is easily used to detect outliers and have this formular, where it takes single data points - mean / standard deviation

When dealing with outliers with z-score we just need to find all data points which are > or <+- 3 sd above mean and below mean. This values are considered as outliers.

So you can easily say show me all the data points which have z scores which are -3/+3. This rows are considered as outliers or they are that number of standard deviation away from the mean.

When do i use Zscore and when do i use IQR

When Detecting Outliers

Method	Use When...	Avoid When...	Notes
Z-score	Data is roughly normally distributed (bell-shaped), and you don't have many extreme outliers	Data is skewed or has many extreme values. This is because Zscore depends on the mean and the mean is affected/dragged in the presence of outliers	Easy to use, but sensitive to outliers because it uses mean and std
IQR	Data may be skewed or not normally distributed	Very small datasets (may over-flag values)	Robust to outliers, uses median and quartiles (Q1 and Q3).

Rule of thumb:

Use domain rules first, then IQR if you're unsure. Z-score is fine if your data is normal and clean which might not always be the case.

When Preprocessing for Machine Learning

You're usually scaling your features before feeding them into models. Here's which scaler to use depending on whether your data has outliers:

Scaler	Use When...	Avoid When...	Notes
StandardScaler	Data is normal, no outliers	Outliers are present	Uses mean and std — sensitive to outliers - mean = 0, std = 1
MinMaxScaler	You want data between 0–1 (like for neural networks), and no outliers	There are large or extreme values	Min/max values stretch the scale badly
RobustScaler	There are outliers, and you want to minimize their impact	Data has no outliers and you want full range scaling	Uses median and IQR — best choice with outliers
No scaling	Tree-based models (Random Forest, XGBoost)	Models that rely on distance (e.g. KNN, SVM, Logistic Regression)	Trees are scale-invariant — no need to scale features

Summary Table

Phase	Best Method if Outliers Present	Why
Outlier Detection	IQR or domain filtering	More robust than z-score
Preprocessing for ML	RobustScaler	Not affected by outliers
ML Model Choice	Tree models (no scaling needed), or scale with RobustScaler for others	Trees are scale-insensitive since the depend on distance.

You can actually use the description function in pandas to understand your dataset alittle. Looking at the age and annual income column, i see min values of 1 and 0 which isnt really reasonable. Same applies to the max values and this makes me know there are outliers without even digging deep.

From Law of Large Numbers to Chi-Square Testing

Law of Large Numbers (LLN)

As the sample size grows, the sample mean gets closer to the population mean. The more and more samples you take from your population the means become closer.

Why it matters:

Justifies why bigger samples give more reliable estimates.

Standard Error (SE)

The standard deviation of the sampling distribution of a statistic (usually the mean).

Formula:

where:

σ\sigma = population standard deviation. we can use the sample mean if the population mean is not known and the sample size is > 30

n = sample size

Z-Score & Z-Table

Used to find p-values and critical values in hypothesis testing.

Confidence Level & Confidence Interval (CL + 1/2 idea)

Confidence Level (CL): Probability that the interval contains the true parameter (e.g., 95% CL).

Why used:

Gives a range of plausible values for a population parameter.

More informative than a single point estimate.

Hypothesis Testing & p-Value

Steps:

Null hypothesis (H₀): Statement to test (e.g., μ = 100).

Alternative hypothesis (H₁): What you’re trying to show.

Choose significance level α (commonly 0.05).

Calculate test statistic (Z or t).

Find p-value = probability of observing results as extreme as yours if H₀ is true.

Decision rule:

If p <alpha → reject H₀.

If p > alpha → fail to reject H₀.

Chi-Square Test (χ²)

Testing relationships between categorical variables (independence test).

Checking if observed distribution matches expected (goodness-of-fit test).

🔗How They Connect


Law of Large Numbers → Larger samples give better estimates
        ↓
Standard Error → Quantifies variability of sample mean
        ↓
Z-Score & Z-Table → Standardize results and get probabilities
        ↓
Confidence Interval → Range estimate for population parameters
        ↓
Hypothesis Testing → Use p-value to make decisions
        ↓
Chi-Square Test → Special case for categorical data relationships