At-a-Glance 👀
How I tackle data science projects step-by-step using CRISP-DM.
Level EDA Process (Regression & Classification)
1. 📦 Understanding the Dataset
What I do:
I start by loading the dataset and exploring its overall structure. I look at the number of rows and columns, check the column names, and inspect the data types.
Why I do it:
This helps me understand the scale of the data, identify which columns are numerical, categorical, datetime, or possibly IDs, and spot early signs of formatting issues.
Steps I take:
df.shape df.info() df.head() df.tail() df.describe(include='all')
2. 🎯 Exploring the Target Variable
For classification:
What I do:
I check the class distribution to detect imbalance.
Why I do it:
Class imbalance can distort accuracy and requires techniques like stratified sampling, SMOTE, or under sampling techniques.
df['target'].value_counts(normalize=True).plot(kind='bar')
For regression:
What I do:
I visualize the target's distribution and check for skewness or extreme values.
Why I do it:
Understanding if the target is normally distributed helps me decide whether to apply transformations like log or Box-Cox.
sns.histplot(df['target'], kde=True) sns.boxplot(x=df['target'])
3. 🔍 Auditing Data Quality
What I do:
I identify missing values, duplicated rows, or incorrect data types.
Why I do it:
Missing or malformed data can break models or lead to misleading results.
Steps I take:
df.isnull().sum().sort_values(ascending=False) df.duplicated().sum() df.duplicated(keep='False') df.dtypes
If I see object columns that should be dates or numerics, I convert them using:
df['date_col'] = pd.to_datetime(df['date_col'], errors='coerce')
4. 🔢 Univariate Feature Exploration
What I do:
I analyze each feature individually.
Why I do it:
This helps me understand distributions, detect weird patterns, and spot constant or low-variance columns.
For numeric features:
df.select_dtypes(include=['int', 'float']).hist(figsize=(15,10))
For categorical features:
for col in cat_cols: display(df[col].value_counts()) sns.countplot(y=col, data=df)
5. 🔗 Bivariate Analysis: Feature–Target Relationships
What I do:
I study how each feature relates to the target variable.
Why I do it:
It tells me which features might be useful predictors, and helps detect data leakage.
For classification:
- Numerical → Target:
sns.boxplot(x='target', y='feature', data=df)
- Categorical → Target:
pd.crosstab(df['feature'], df['target'], normalize='index').plot(kind='bar', stacked=True)
For regression:
- Numerical → Target:
sns.scatterplot(x='feature', y='target', data=df)
- Categorical → Target:
df.groupby('feature')['target'].mean().plot(kind='bar')
6. 🔁 Multivariate Relationships Between Features
What I do:
I analyze how features relate to each other.
Why I do it:
This helps me identify redundant features and potential multicollinearity, especially important in regression.
For numeric–numeric:
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
For categorical–categorical:
I calculate Cramér’s V:
def cramers_v(x, y): confusion = pd.crosstab(x, y) chi2 = chi2_contingency(confusion)[0] n = confusion.sum().sum() return np.sqrt(chi2 / (n * (min(confusion.shape) - 1)))
7. ⚠️ Checking for Data Leakage
What I do:
I test whether any feature is too strongly tied to the target — possibly because it was created after the fact.
Why I do it:
Data leakage inflates accuracy unrealistically and causes models to fail in production.
How I check:
- I look for features where every unique value maps to one target class:
df.groupby('feature')['target'].nunique()
- I check Cramér’s V or correlation with the target — if it’s ~1, that’s a red flag.
- I ask: “Would I have access to this information at prediction time?”
8. 🧹 Outlier Detection
What I do:
I look for extreme values in both features and the target (especially for regression).
Why I do it:
Outliers can distort the loss function and bias the model.
How I check:
- Using the IQR method:
Q1 = df.quantile(0.25) Q3 = df.quantile(0.75) IQR = Q3 - Q1 outliers = ((df < Q1 - 1.5*IQR) | (df > Q3 + 1.5*IQR)).sum()
- Or z-score:
from scipy.stats import zscore z_scores = np.abs(zscore(df[numeric_cols]))
9. 🧮 Multicollinearity Check (Regression Only)
What I do:
I measure how much a feature is linearly dependent on others using VIF (Variance Inflation Factor).
Why I do it:
High multicollinearity can inflate model variance and confuse interpretation.
How I check:
from statsmodels.stats.outliers_influence import variance_inflation_factor X = df[numeric_cols].drop('target', axis=1) vif_data = pd.DataFrame() vif_data["feature"] = X.columns vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
If VIF > 10, I consider dropping or combining the feature.
10. 🧪 Quick Predictive Sanity Check
What I do:
I train a basic model to test if the data contains usable signal.
Why I do it:
This gives me an early sense of whether I'm headed in the right direction — and whether something (like leakage) is inflating performance.
from sklearn.ensemble import RandomForestClassifier from sklearn.model_selection import cross_val_score score = cross_val_score(RandomForestClassifier(), X, y, cv=5) print("Baseline CV Accuracy:", score.mean())
11. ✍️ Documenting My Findings
What I do:
I summarize everything I discovered in markdown cells or a separate report.
Why I do it:
Clean documentation makes it easy to revisit decisions, collaborate with teams, and debug issues later.
I include:
- Data quality issues
- Outliers and leakage risks
- Key patterns and features
- Imbalance concerns
- Encoding or transformation needs
- Next steps for modeling
12. 🚀 Deliverables Before Modeling
Before moving to feature engineering or modeling, I make sure to finalize:
Output File | Purpose |
eda_notebook.ipynb | Full EDA with plots, markdown, decisions |
eda_report.md/pdf | Executive summary |
cleaned_data.csv | Cleaned dataset with transformations |
outliers.csv | Optional: file with flagged outliers |