End-to-End Data Platform

Built and maintain an end-to-end incremental batch production Azure data platform as sole data engineer - ADF ingestion, Databricks medallion pipeline, and Power BI dashboards for live auction analytics.. all governed by Unity Catalog.

1. How It Works

Raw files (CSV, JSON, Parquet) land in Azure Data Lake Storage Gen2 in dated folders. A scheduled Databricks Control Job checks a batch control table, picks the oldest unprocessed folder, and kicks off processing through three layers: Bronze ingests raw files as-is with a batch ID and timestamp; Silver cleans, deduplicates, and conforms the data; Gold produces business-ready aggregates and dimensional models for reporting.

Every batch moves through a tracked lifecycle -PENDING → IN_PROGRESS → COMPLETED / FAILED -making the pipeline fully idempotent and safely retriable. Curated data is served to Power BI via Databricks SQL endpoint, with access governed by Unity Catalog RBAC.

2. Unity Catalog Hierarchy & Databricks Connectivity

How Azure storage is connected to Databricks through managed identity, external locations, and Unity Catalog - mapping ADLS Gen2 containers to landing, bronze, silver, gold, and batch control schemas.

3. Medallion Architecture — Step-by-Step Breakdown

End-to-end view of the five pipeline stages: Landing → Bronze (raw ingestion) → Silver (cleansing and conforming) → Gold (curated marts and KPIs) → Consumption via Power BI, ML notebooks, and APIs.

4.Control / Batch Table Orchestration

How the pipeline discovers, locks, and tracks each batch using the batch_control schema -covering the full lifecycle from folder discovery through PENDING → IN_PROGRESS → COMPLETED / FAILED with a full audit trail

5.Processing Pipeline -Technical Breakdown

Detailed step-by-step walkthrough of a single batch flowing through all five stages, including schema enforcement, deduplication, SCD upserts, Gold mart assembly, and downstream consumption with audit closure.

6.Power BI Consumption - End-to-End Architecture

How curated Gold tables are exposed to Power BI via Databricks SQL Warehouse, secured through Unity Catalog ACLs and Entra ID authentication over a private endpoint.

7.Azure Data Factory Ingestion (Future Implementation)

Planned ADF pipeline to automate ingestion from Azure SQL Database and REST APIs into the ADLS Gen2 landing zone using parameterized Copy Activities triggered on a schedule.

8.Databricks Jobs - Batch Control & Processing Workflow

The two Databricks jobs that power orchestration: the Control Job identifies and registers the next batch, then triggers the Processing Job to run Bronze → Silver transformations.

Azure Databricks Incremental Batch Pipeline