This repository showcases an end-to-end, cloud-native data engineering solution built on Google Cloud Platform (GCP).
It demonstrates how enterprise SAP and operational data can be ingested, transformed, governed, and delivered as executive-ready analytics.
- Translate complex business requirements into scalable cloud data pipelines
- Ingest SAP Finance and operational datasets into a unified analytics platform
- Optimize query performance and cloud costs
- Enable executive decision-making through curated dashboards
- Apply software engineering best practices to data pipelines
Flow:
SAP & Operational Sources
→ Cloud Storage
→ Python-based ETL ingestion
→ BigQuery (staging → unified facts)
→ Analytics & cost optimization queries
→ Looker / Data Studio dashboards
Orchestration is handled via Airflow (Cloud Composer).
- Handles source extraction and ingestion
- Loads data into BigQuery staging tables
- Includes logging and error handling
- Cloud SDK–based (production-ready)
📄 etl/etl_ingestion.py
- Unifies SAP and operational data
- Normalizes schemas and business statuses
- Produces analytics-ready fact tables
📄 etl/etl_transformations.sql
Example logic:
- Multi-source union
- Business rule mapping
- Status normalization
- Source lineage tagging
- Managed via Airflow (Cloud Composer)
- Daily scheduled pipelines
- Clear separation of ingestion and transformation tasks
📄 orchestration/airflow_dag.py
This mirrors enterprise scheduling patterns used in production environments.
Includes:
-
Query performance monitoring
-
BigQuery slot usage analysis
-
Cost efficiency reporting
-
Historical query tracking via INFORMATION_SCHEMA
📄
analytics/cost_effeciency.sql📄analytics/performance_queries.sql
Used to:
- Reduce query runtimes
- Lower compute spend
- Support FinOps initiatives
Dashboard Features:
- KPI cards (Revenue, Collections, PTPs, Cost Savings)
- Trend analysis over time
- Cost optimization metrics
- Pipeline health indicators
Dashboards are designed for senior leadership consumption.
- Schema validation
- Basic data quality checks
- CI-integrated testing
📄 tests/test_data_quality.py
Example:
- No negative financial values
- Required columns enforced
- Prevents bad data from reaching analytics layers
- Automated via GitHub Actions
- Runs on every pull request
- Validates Python and SQL assets
- Enforces engineering discipline for data pipelines
📄 .github/workflows/ci.yml
- Data lineage and source traceability
- Status standardization logic
- Reproducible transformations
- Analytics-ready data modeling
Governance principles:
- Auditability
- Reusability
- Scalability
- Security-by-design
This solution is cloud-agnostic by design.
| Layer | GCP | AWS | Azure |
|---|---|---|---|
| Storage | GCS | S3 | ADLS |
| ETL | Dataflow | Glue | Data Factory |
| Orchestration | Composer | MWAA | ADF |
| Warehouse | BigQuery | Redshift | Synapse |
| BI | Looker | QuickSight | Power BI |
Core design patterns remain consistent across platforms.
- Improved query performance by up to 40%
- Reduced cloud compute costs
- Delivered executive-grade dashboards
- Implemented production-style ETL, testing, and CI/CD
- Mentored junior engineers on data platform best practices
Andiswa Matai
Senior Data Engineer | Analytics & Cloud Platforms
🔗 Return to main portfolio: Andiswa-Matai_Portfolio

