Behind the Models: A breakdown of my end-to-end approach to building data science solutions

At-a-Glance 👀

How I tackle data science projects step-by-step using CRISP-DM.

Level EDA Process (Regression & Classification)

1. 📦 Understanding the Dataset

What I do:

I start by loading the dataset and exploring its overall structure. I look at the number of rows and columns, check the column names, and inspect the data types.

Why I do it:

This helps me understand the scale of the data, identify which columns are numerical, categorical, datetime, or possibly IDs, and spot early signs of formatting issues.

Steps I take:


df.shape
df.info()
df.head()
df.tail()
df.describe(include='all')

2. 🎯 Exploring the Target Variable

For classification:

What I do:

I check the class distribution to detect imbalance.

Why I do it:

Class imbalance can distort accuracy and requires techniques like stratified sampling, SMOTE, or under sampling techniques.


df['target'].value_counts(normalize=True).plot(kind='bar')

For regression:

What I do:

I visualize the target's distribution and check for skewness or extreme values.

Why I do it:

Understanding if the target is normally distributed helps me decide whether to apply transformations like log or Box-Cox.


sns.histplot(df['target'], kde=True)
sns.boxplot(x=df['target'])

3. 🔍 Auditing Data Quality

What I do:

I identify missing values, duplicated rows, or incorrect data types.

Why I do it:

Missing or malformed data can break models or lead to misleading results.

Steps I take:


df.isnull().sum().sort_values(ascending=False)
df.duplicated().sum()
df.duplicated(keep='False')
df.dtypes

If I see object columns that should be dates or numerics, I convert them using:


df['date_col'] = pd.to_datetime(df['date_col'], errors='coerce')

4. 🔢 Univariate Feature Exploration

What I do:

I analyze each feature individually.

Why I do it:

This helps me understand distributions, detect weird patterns, and spot constant or low-variance columns.

For numeric features:


df.select_dtypes(include=['int', 'float']).hist(figsize=(15,10))

For categorical features:


for col in cat_cols:
    display(df[col].value_counts())
    sns.countplot(y=col, data=df)

5. 🔗 Bivariate Analysis: Feature–Target Relationships

What I do:

I study how each feature relates to the target variable.

Why I do it:

It tells me which features might be useful predictors, and helps detect data leakage.

For classification:

Numerical → Target:


sns.boxplot(x='target', y='feature', data=df)

Categorical → Target:


pd.crosstab(df['feature'], df['target'], normalize='index').plot(kind='bar', stacked=True)

For regression:

Numerical → Target:


sns.scatterplot(x='feature', y='target', data=df)

Categorical → Target:


df.groupby('feature')['target'].mean().plot(kind='bar')

6. 🔁 Multivariate Relationships Between Features

What I do:

I analyze how features relate to each other.

Why I do it:

This helps me identify redundant features and potential multicollinearity, especially important in regression.

For numeric–numeric:


sns.heatmap(df.corr(), annot=True, cmap='coolwarm')

For categorical–categorical:

I calculate Cramér’s V:


def cramers_v(x, y):
    confusion = pd.crosstab(x, y)
    chi2 = chi2_contingency(confusion)[0]
    n = confusion.sum().sum()
    return np.sqrt(chi2 / (n * (min(confusion.shape) - 1)))

7. ⚠️ Checking for Data Leakage

What I do:

I test whether any feature is too strongly tied to the target — possibly because it was created after the fact.

Why I do it:

Data leakage inflates accuracy unrealistically and causes models to fail in production.

How I check:

I look for features where every unique value maps to one target class:


df.groupby('feature')['target'].nunique()

I check Cramér’s V or correlation with the target — if it’s ~1, that’s a red flag.

I ask: “Would I have access to this information at prediction time?”

8. 🧹 Outlier Detection

What I do:

I look for extreme values in both features and the target (especially for regression).

Why I do it:

Outliers can distort the loss function and bias the model.

How I check:

Using the IQR method:


Q1 = df.quantile(0.25)
Q3 = df.quantile(0.75)
IQR = Q3 - Q1
outliers = ((df < Q1 - 1.5*IQR) | (df > Q3 + 1.5*IQR)).sum()

Or z-score:


from scipy.stats import zscore
z_scores = np.abs(zscore(df[numeric_cols]))

9. 🧮 Multicollinearity Check (Regression Only)

What I do:

I measure how much a feature is linearly dependent on others using VIF (Variance Inflation Factor).

Why I do it:

High multicollinearity can inflate model variance and confuse interpretation.

How I check:


from statsmodels.stats.outliers_influence import variance_inflation_factor

X = df[numeric_cols].drop('target', axis=1)
vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]

If VIF > 10, I consider dropping or combining the feature.

10. 🧪 Quick Predictive Sanity Check

What I do:

I train a basic model to test if the data contains usable signal.

Why I do it:

This gives me an early sense of whether I'm headed in the right direction — and whether something (like leakage) is inflating performance.


from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

score = cross_val_score(RandomForestClassifier(), X, y, cv=5)
print("Baseline CV Accuracy:", score.mean())

11. ✍️ Documenting My Findings

What I do:

I summarize everything I discovered in markdown cells or a separate report.

Why I do it:

Clean documentation makes it easy to revisit decisions, collaborate with teams, and debug issues later.

I include:

Data quality issues

Outliers and leakage risks

Key patterns and features

Imbalance concerns

Encoding or transformation needs

Next steps for modeling

12. 🚀 Deliverables Before Modeling

Before moving to feature engineering or modeling, I make sure to finalize:

Output File	Purpose
`eda_notebook.ipynb`	Full EDA with plots, markdown, decisions
`eda_report.md/pdf`	Executive summary
`cleaned_data.csv`	Cleaned dataset with transformations
`outliers.csv`	Optional: file with flagged outliers