Descriptive vs. Inferential Statistics

Descriptive Statistics
- What it studies: Things like average, minimum, maximum, variance, and standard deviation
- Purpose: To summarize and describe the data you already have
Inferential Statistics
- What it studies: Hypothesis testing, confidence intervals, regression
- Purpose: To make guesses or predictions about a bigger group (population) using a smaller group (sample)
Measures of Central Tendency: Mean, Median, Mode


On the left side of the image, we see the monthly income of six people. The average (mean) income is $6,250. Based on this, someone might decide it’s a good area to open a luxury store.
But on the right side, one more person is added; Elon Musk, who earns $10 million a month. This makes the new average income jump to $1.43 million.
That number doesn’t reflect the real income of most people in the area. Musk’s high income is an outlier, and outliers can pull the average way up or down. So using just the average (mean) isn’t always a good way to understand a group, especially when there's one value much higher or lower than the rest.
So when dealing with missing values, we first try to take care of these outliers before imputation.
Range, IQR
IQR looks at the difference(Range) between q3 and q1. This makes it robust against outliers are it considered the middle values. This shows the data spread and anything greater than or less than the upper and lower bonds are considered outliers.



Measures of Dispersion: Variance and Standard Deviation
Both variance and standard deviation (SD) tell us how spread out the numbers are in a data set like daily stock prices.
- Variance shows how much the numbers differ from the average (mean).
- Standard deviation is just the square root of the variance. It tells us the same thing, but in the same unit as the data (like dollars).
Example: Tesla vs Coca-Cola
Tesla (High Variance and SD)
Tesla’s stock might go from:
- $680 → $700 → $630 → $710 → $650
These prices jump up and down a lot, even if the average is around $674. This means:
- High variance
- High standard deviation
Investors call Tesla a volatile or risky stock because it changes value a lot.
Coca-Cola (Low Variance and SD)
Coca-Cola’s stock might go:
- $60 → $61 → $59 → $60 → $61
These prices are very close to each other and to the average (about $60). This means:
- Low variance
- Low standard deviation
Coca-Cola is seen as a stable stock with low risk.



The left table (orange) and right table (green) show two different distributions of yearly income, along with dot plots visualizing them.
Left Table (Orange) – Low Variance & Low Standard Deviation
- The yearly incomes are:
71, 62, 66, 61, 54, 67, 55, 60
- These values are closely around the mean (≈ 62).
- Low spread = Low variance and low standard deviation
- This means the data is consistent and stable, with small deviations from the mean.
Right Table (Green) – High Variance & High Standard Deviation
- The yearly incomes are:
99, 14, 75, 84, 44, 54, 98, 28
- The values are widely spread out — ranging from 14 to 99.
- High spread = High variance and high standard deviation
- This means the data is less consistent, with large deviations from the average.
📊 Summary:
Group | Mean | Spread of Data | Variance & SD |
Left (Orange) | ≈ 62 | Tight, consistent | Low |
Right (Green) | ≈ 62 | Wide, inconsistent | High |
Skewness

Data can be normally distributed, or it can be skewed to the left or right.
- When data is left-skewed (negative skew), most values are high, but a few low numbers pull the "tail" to the left. This can affect how your model learns from the data.
- When data is right-skewed (positive skew), most values are low, but a few very high numbers stretch the "tail" to the right. This also shifts the data away from the average.
How Skewness Affects Modeling:
Impact | Explanation |
1. Biased predictions | Models may learn patterns that are more influenced by the skewed tail. |
2. Poor accuracy | Linear models (like regression) assume normality—skew breaks that assumption. |
3. Wrong feature importance | Outliers in skewed data can make a feature seem more or less important. |
To fix skewed data, we often:
- Look for outliers (extremely high or low values) and treat them so the data becomes more balanced or normally distributed, which helps improve model accuracy.

Normal distribution: data is balanced around the mean.
Quick Recap: Empirical Rule (Normal Distribution)
- 68% of data lies within mean ±1 standard deviation of the mean.
- 95% lies within mean ±2 standard deviations.
- 99.7% lies within mean ±3 standard deviations.
Anything lying outside the ±3 are considered outliers and should be dealt with.


Take the data, pd.series and find the mean and standard deviation of the data. Then calculate the +- standard for 1,2, and 3 and then you can filter it with the data to see the transactions which data points fall within 1, 2 and 3.
Anything which is above 3*std or -3*std are considered outliers.