fix(ingest/powerbi): emit lineage for paginated reports with embedded RDL datasources#17582
Conversation
c71a386 to
74e0170
Compare
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
|
The test calls: read_mock_data does a plain open() with no fallback. The file paginated_report_rdl_datasources.json is not in the PR. Every CI run will raise FileNotFoundError before a single assertion executes. This is an unconditional blocker — the test cannot pass. Fix: Add the mock JSON file to tests/integration/powerbi/mock_data/ in the PR. |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Connector Tests ResultsAll connector tests passed for commit To skip connector tests, add the Autogenerated by the connector-tests CI pipeline. |
Summary
PowerBI paginated reports (RDL) that connect directly to a SQL Server / Oracle / PostgreSQL / OleDb backend instead of a shared PowerBI dataset were silently emitting zero upstream lineage. The connector treats every report the same and only resolves lineage via
report.dataset_id, which the workspace scan API never returns for embedded RDL datasources. Result: 0/N paginated reports get adatasetEdgesaspect, with no warning in the ingestion report.This PR adds a fallback path that calls
GET /groups/{groupId}/reports/{reportId}/datasourcesforPaginatedReports without adataset_id, and turns the returneddatasourceType+connectionDetails.{server,database}into upstreamdatasetEdgesusing the existing platform-mapping machinery (SupportedDataPlatform,server_to_platform_instance).What changes
ReportDatasourcedataclass +Report.datasourcesfield for the new payload (data_classes.py).Constant.REPORT_DATASOURCESendpoint key andget_report_datasources()onDataResolverBase, with a regular-API implementation and an admin-API no-op (config.py,data_resolver.py). PowerBI doesn't expose an admin variant of this endpoint, soadmin_apis_onlydeployments get an empty list.PowerBiAPI.get_reports()that only fetches/datasourcesforPaginatedReports without adataset_id, so zero extra HTTP traffic for any existing code path (powerbi_api.py).paginated_report_datasource_urns()helper that maps eachReportDatasourceto a DataHub URN viaSupportedDataPlatform+server_to_platform_instance, and folds the URNs into the existingdataset_edgesset inreport_to_datahub_work_units()(powerbi.py). No new aspect type, no new MCP shape - the lineage rides the sameDashboardInfo.datasetEdgespath regular reports use today.Missing Lineage For Paginated Reportinfo event so silent skips surface in the ingestion report. The existingMissing Lineage For Reportevent only fires whendataset_idis present but unresolvable, which is why this bug class has been invisible.test_powerbi_paginated_report_rdl_lineage+ dedicated mock fixture covering exactly the bug-case shape (nodatasetId, with/datasourcesreturning SQL Server connection details).Observability
After this change, every paginated report that doesn't resolve to upstream lineage produces a structured info event:
Missing Lineage For Paginated Report- report has nodataset_idand/datasourcesreturned nothing (admin-only mode, permission gap, or genuinely empty)Unmapped Paginated Report Datasource-/datasourcesreturned adatasourceTypethat has no entry inSupportedDataPlatform(e.g., a Microsoft-only type we haven't mapped yet)Both messages name the report and (where applicable) the server so operators can diagnose missing lineage without re-reading the connector source.