Behind the Models: A breakdown of my end-to-end approach to building data science solutions

Behind the Models: A breakdown of my end-to-end approach to building data science solutions

At-a-Glance đź‘€

How I tackle data science projects step-by-step using CRISP-DM.

Level EDA Process (Regression & Classification)

1. 📦 Understanding the Dataset

What I do:

I start by loading the dataset and exploring its overall structure. I look at the number of rows and columns, check the column names, and inspect the data types.

Why I do it:

This helps me understand the scale of the data, identify which columns are numerical, categorical, datetime, or possibly IDs, and spot early signs of formatting issues.

Steps I take:

df.shape
df.info()
df.head()
df.tail()
df.describe(include='all')

2. 🎯 Exploring the Target Variable

For classification:

What I do:

I check the class distribution to detect imbalance.

Why I do it:

Class imbalance can distort accuracy and requires techniques like stratified sampling, SMOTE, or under sampling techniques.

df['target'].value_counts(normalize=True).plot(kind='bar')

For regression:

What I do:

I visualize the target's distribution and check for skewness or extreme values.

Why I do it:

Understanding if the target is normally distributed helps me decide whether to apply transformations like log or Box-Cox.

sns.histplot(df['target'], kde=True)
sns.boxplot(x=df['target'])

3. 🔍 Auditing Data Quality

What I do:

I identify missing values, duplicated rows, or incorrect data types.

Why I do it:

Missing or malformed data can break models or lead to misleading results.

Steps I take:

df.isnull().sum().sort_values(ascending=False)
df.duplicated().sum()
df.duplicated(keep='False')
df.dtypes

If I see object columns that should be dates or numerics, I convert them using:

df['date_col'] = pd.to_datetime(df['date_col'], errors='coerce')

3.5 Handling Missing Values

What I do:

I identify which columns have missing values, determine why they’re missing, and decide whether to drop, fill (impute), or leave them as-is.

Step 1 — Check for Missingness

df.isnull().sum().sort_values(ascending=False)
(df.isnull().sum() / len(df)) * 100  # % missing per column
sns.heatmap(df.isnull(), cbar=False)

If a column has >70% missing values, it often provides little information and can be dropped, unless it’s critical for business use.

Step 2 — Understand Why Data Is Missing

Type
Meaning
Example
Common Fix
MCAR – Missing Completely At Random
No pattern in missingness
Random survey errors
Simple imputation (mean/median/mode)
MAR – Missing At Random
Missingness depends on other columns
Income missing more often for younger people
Group-based imputation
MNAR – Missing Not At Random
Missingness depends on its own value
People with high debt skip “Debt” question
Treat separately or model missingness as a feature

Step 3 — Check Dependence on Other Columns

If missing values in one column depend on another feature, I use grouped imputation based on business logic.

Example:

df['Income'] = df.groupby('Occupation')['Income'].transform(
    lambda x: x.fillna(x.median())
)

This imputes missing Income values using the median income for each occupation — more realistic than a global fill.

Step 4 — Choose the Right Imputation Method

Data Type
Simple Imputation
Advanced Imputation
Numerical
Mean, Median
Regression, KNNImputer
Categorical
Mode, “Unknown”
Group-based mode, OneHot with NaN
Time Series
Forward/Backward fill
Interpolation

📊 Summary

Situation
Action
>70% missing in a column
Drop column
Missingness depends on another variable
Group imputation using related column
Random missing values
Mean/median/mode imputation
Categorical missing values
Mode or “Unknown”
Time series data
Forward/backward fill

4. Univariate Feature Exploration

What I do:

I analyze each feature individually.

Why I do it:

This helps me understand distributions, detect weird patterns, and spot constant or low-variance columns.

For numeric features:

df.select_dtypes(include=['int', 'float']).hist(figsize=(15,10))

For categorical features:

for col in cat_cols:
    display(df[col].value_counts())
    sns.countplot(y=col, data=df)

5. Bivariate Analysis: Feature–Target Relationships

What I do:

I study how each feature relates to the target variable.

Why I do it:

It tells me which features might be useful predictors, and helps detect data leakage.

For classification:

  • Numerical → Target:
sns.boxplot(x='target', y='feature', data=df)
  • Categorical → Target:
pd.crosstab(df['feature'], df['target'], normalize='index').plot(kind='bar', stacked=True)

For regression:

  • Numerical → Target:
sns.scatterplot(x='feature', y='target', data=df)
  • Categorical → Target:
df.groupby('feature')['target'].mean().plot(kind='bar')

6. Multivariate Relationships Between Features

What I do:

I analyze how features relate to each other.

Why I do it:

This helps me identify redundant features and potential multicollinearity, especially important in regression.

For numeric–numeric:

sns.heatmap(df.corr(), annot=True, cmap='coolwarm')

For categorical–categorical:

I calculate Cramér’s V:

def cramers_v(x, y):
    confusion = pd.crosstab(x, y)
    chi2 = chi2_contingency(confusion)[0]
    n = confusion.sum().sum()
    return np.sqrt(chi2 / (n * (min(confusion.shape) - 1)))

7. Checking for Data Leakage

What I do:

I test whether any feature is too strongly tied to the target — possibly because it was created after the fact.

Why I do it:

Data leakage inflates accuracy unrealistically and causes models to fail in production.

How I check:

  • I look for features where every unique value maps to one target class:
df.groupby('feature')['target'].nunique()
  • I check CramĂ©r’s V or correlation with the target — if it’s ~1, that’s a red flag.
  • I ask: “Would I have access to this information at prediction time?”

8. Outlier Detection

What I do:

I look for extreme values in both features and the target (especially for regression).

Why I do it:

Outliers can distort the loss function and bias the model.

How I check:

  • Using the IQR method:
Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
outliers = ((df < Q1 - 1.5*IQR) | (df > Q3 + 1.5*IQR)).sum()
  • Or z-score:
from scipy.stats import zscore
z_scores = np.abs(zscore(df[numeric_cols]))

9. Multicollinearity Check (Regression Only)

What I do:

I measure how much a feature is linearly dependent on others using VIF (Variance Inflation Factor).

Why I do it:

High multicollinearity can inflate model variance and confuse interpretation.

How I check:

from statsmodels.stats.outliers_influence import variance_inflation_factor

X = df[numeric_cols].drop('target', axis=1)
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

If VIF > 10, I consider dropping or combining the feature.

10. Quick Predictive Sanity Check

What I do:

I train a basic model to test if the data contains usable signal.

Why I do it:

This gives me an early sense of whether I'm headed in the right direction — and whether something (like leakage) is inflating performance.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

score = cross_val_score(RandomForestClassifier(), X, y, cv=5)
print("Baseline CV Accuracy:", score.mean())

11. Documenting My Findings

What I do:

I summarize everything I discovered in markdown cells or a separate report.

Why I do it:

Clean documentation makes it easy to revisit decisions, collaborate with teams, and debug issues later.

I include:

  • Data quality issues
  • Outliers and leakage risks
  • Key patterns and features
  • Imbalance concerns
  • Encoding or transformation needs
  • Next steps for modeling

12. Deliverables Before Modeling

Before moving to feature engineering or modeling, I make sure to finalize:

Output File
Purpose
eda_notebook.ipynb
Full EDA with plots, markdown, decisions
eda_report.md/pdf
Executive summary
cleaned_data.csv
Cleaned dataset with transformations
outliers.csv
Optional: file with flagged outliers