End-to-end Data Engineering pipeline built on Azure, processing retail sales data from raw ingestion to analytical dashboard using a Bronze → Silver → Gold architecture.
Azure SQL DB ──┐
├──► Azure Data Factory ──► ADLS Gen2 (Bronze) ──► Databricks (Silver) ──► Databricks (Gold) ──► Power BI
GitHub API ────┘
| Layer | Description |
|---|---|
| Bronze | Raw Parquet files ingested as-is from sources |
| Silver | Cleaned, typed, joined, enriched data — Delta format |
| Gold | Business aggregations ready for reporting |
- Azure SQL Database — Source for Products, Stores, Transactions
- GitHub API (JSON) — Source for Customers data
- Azure Data Factory — Orchestration & ingestion (4 Copy Activities)
- Azure Data Lake Storage Gen2 — Medallion storage (
hkhiar-datalake) - Databricks Community Edition — PySpark transformations (Silver & Gold)
- Delta Lake — Silver layer storage format
- Power BI Desktop — Final dashboard
├── notebooks/
│ ├── ADLS_Bronze.py # Load Bronze → clean → write Silver (Delta)
│ └── ADLS_Silver.py # Load Silver → aggregate → write Gold
├── dashboard/
│ └── SalesAnalysis.pbix # Power BI dashboard
└── README.md
- Azure SQL tables:
Products,Stores,Transactions - External JSON source via GitHub API:
Customers
4 Copy Activities copying data to ADLS Gen2 in Parquet format:
SQL → bronze/transaction/
SQL → bronze/product/
SQL → bronze/store/
API → bronze/customer/
- Type casting (
IntegerType,FloatType,DateType) - Deduplication & null handling
- 4-way join: transactions ⋈ products ⋈ stores ⋈ customers
- Computed column:
total_amount = quantity × price - Written as Delta to
Silver/
Aggregations by product_id, product_name, category, transaction_date:
total_quantity— units soldtotal_revenue— revenue (DH)
Gold exported to CSV → imported into Power BI with a generated Date Table and Measure Table.
Sales Analysis — Office Supplies
| KPI | Value |
|---|---|
| Total Sales | 35.26K DH |
| Total Quantity | 101 units |
| Total Products | 8 |
| Total Categories | 4 |
Visuals: Revenue by Date · Quantity by Date · Sales by Category (pie) · Sales by Product (100% bar) · Quantity by Category (donut)
Total Sales = SUM(gold_layer_transformations[total_revenue])
Total Quantity = SUM(gold_layer_transformations[total_quantity])
Total Products = DISTINCTCOUNT(gold_layer_transformations[product_name])
Total Categories = DISTINCTCOUNT(gold_layer_transformations[category])
Date Table =
VAR MinDate = MIN(gold_layer_transformations[transaction_date])
VAR MaxDate = MAX(gold_layer_transformations[transaction_date])
RETURN
ADDCOLUMNS(
CALENDAR(MinDate, MaxDate),
"Year", YEAR([Date]),
"MonthNum", MONTH([Date]),
"MonthName", FORMAT([Date], "MMMM")
)
Storage account key managed via Databricks secrets:
storage_account_key = dbutils.secrets.get(scope="azure_access_key", key="storage_key")Paths use abfss:// protocol — no mount points:
base_path = f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net/"- Provision Azure SQL DB and insert source data
- Create ADLS Gen2 storage account with Hierarchical Namespace enabled
- Set up Databricks secret scope with your storage key
- Trigger ADF pipeline — verify Parquet files in Bronze
- Run
ADLS_Bronze.pynotebook — Silver Delta table is written - Run
ADLS_Silver.pynotebook — Gold aggregations are written - Export Gold as CSV and open
SalesAnalysis.pbixin Power BI Desktop