Skip to content

fix(ingest/powerbi): emit lineage for paginated reports with embedded RDL datasources#17582

Merged
treff7es merged 7 commits into
masterfrom
fix/powerbi-paginated-report-rdl-lineage
May 29, 2026
Merged

fix(ingest/powerbi): emit lineage for paginated reports with embedded RDL datasources#17582
treff7es merged 7 commits into
masterfrom
fix/powerbi-paginated-report-rdl-lineage

Conversation

@alfiyas-datahub
Copy link
Copy Markdown
Contributor

Summary

PowerBI paginated reports (RDL) that connect directly to a SQL Server / Oracle / PostgreSQL / OleDb backend instead of a shared PowerBI dataset were silently emitting zero upstream lineage. The connector treats every report the same and only resolves lineage via report.dataset_id, which the workspace scan API never returns for embedded RDL datasources. Result: 0/N paginated reports get a datasetEdges aspect, with no warning in the ingestion report.

This PR adds a fallback path that calls GET /groups/{groupId}/reports/{reportId}/datasources for PaginatedReports without a dataset_id, and turns the returned datasourceType + connectionDetails.{server,database} into upstream datasetEdges using the existing platform-mapping machinery (SupportedDataPlatform, server_to_platform_instance).

What changes

  • New ReportDatasource dataclass + Report.datasources field for the new payload (data_classes.py).
  • New Constant.REPORT_DATASOURCES endpoint key and get_report_datasources() on DataResolverBase, with a regular-API implementation and an admin-API no-op (config.py, data_resolver.py). PowerBI doesn't expose an admin variant of this endpoint, so admin_apis_only deployments get an empty list.
  • New branch in PowerBiAPI.get_reports() that only fetches /datasources for PaginatedReports without a dataset_id, so zero extra HTTP traffic for any existing code path (powerbi_api.py).
  • New paginated_report_datasource_urns() helper that maps each ReportDatasource to a DataHub URN via SupportedDataPlatform + server_to_platform_instance, and folds the URNs into the existing dataset_edges set in report_to_datahub_work_units() (powerbi.py). No new aspect type, no new MCP shape - the lineage rides the same DashboardInfo.datasetEdges path regular reports use today.
  • New structured Missing Lineage For Paginated Report info event so silent skips surface in the ingestion report. The existing Missing Lineage For Report event only fires when dataset_id is present but unresolvable, which is why this bug class has been invisible.
  • New integration test test_powerbi_paginated_report_rdl_lineage + dedicated mock fixture covering exactly the bug-case shape (no datasetId, with /datasources returning SQL Server connection details).

Observability

After this change, every paginated report that doesn't resolve to upstream lineage produces a structured info event:

  • Missing Lineage For Paginated Report - report has no dataset_id and /datasources returned nothing (admin-only mode, permission gap, or genuinely empty)
  • Unmapped Paginated Report Datasource - /datasources returned a datasourceType that has no entry in SupportedDataPlatform (e.g., a Microsoft-only type we haven't mapped yet)

Both messages name the report and (where applicable) the server so operators can diagnose missing lineage without re-reading the connector source.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Codecov Report

❌ Patch coverage is 46.93878% with 26 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...on/src/datahub/ingestion/source/powerbi/powerbi.py 7.69% 12 Missing ⚠️
...ion/source/powerbi/rest_api_wrapper/powerbi_api.py 11.11% 8 Missing ⚠️
...n/source/powerbi/rest_api_wrapper/data_resolver.py 25.00% 6 Missing ⚠️

📢 Thoughts on this report? Let us know!

Comment thread metadata-ingestion/tests/integration/powerbi/test_powerbi.py Outdated
Comment thread metadata-ingestion/src/datahub/ingestion/source/powerbi/powerbi.py Outdated
@treff7es
Copy link
Copy Markdown
Contributor

The test calls:

read_mock_data(
    pytestconfig.rootpath
    / "tests/integration/powerbi/mock_data/paginated_report_rdl_datasources.json"
)

read_mock_data does a plain open() with no fallback. The file paginated_report_rdl_datasources.json is not in the PR. Every CI run will raise FileNotFoundError before a single assertion executes. This is an unconditional blocker — the test cannot pass.

Fix: Add the mock JSON file to tests/integration/powerbi/mock_data/ in the PR.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Codecov Report

❌ Patch coverage is 95.94595% with 3 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...on/src/datahub/ingestion/source/powerbi/powerbi.py 88.88% 2 Missing ⚠️
...ion/source/powerbi/rest_api_wrapper/powerbi_api.py 95.23% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@datahub-connector-tests
Copy link
Copy Markdown

Connector Tests Results

All connector tests passed for commit 3260ca8

View full test logs →

To skip connector tests, add the skip-connector-tests label (org members only).

Autogenerated by the connector-tests CI pipeline.

Comment thread metadata-ingestion/src/datahub/ingestion/source/powerbi/powerbi.py Outdated
@treff7es treff7es merged commit 2ec3992 into master May 29, 2026
53 checks passed
@treff7es treff7es deleted the fix/powerbi-paginated-report-rdl-lineage branch May 29, 2026 14:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants