Skip to content

Support for Historical CDC Backfill (Append Flow with once=True) for Non-Delta Sources (e.g., Parquet) in DLT-META #271

@shishupalgeek

Description

@shishupalgeek

Hi Team,

I would like to request support for loading historical CDC (Change Data Capture) data as part of a typical ingestion pattern from SAP DI sources.


Use Case

My use case involves:

  • Initial load (historical backfill) of CDC data
  • Followed by incremental CDC ingestion

Current Approach

I am currently implementing this using:

  • Append flow for CDC ingestion → working as expected
  • Append flow with once=True for initial backfill, based on official Databricks guidance:

https://learn.microsoft.com/en-us/azure/databricks/ldp/flows-backfill

This approach works successfully when using Spark Declarative Pipelines.

Image

Challenge in DLT-META

In DLT-META, I am facing limitations because:

  • It currently does not support batch reads from Parquet or other file formats
  • Batch Supported formats appear limited to:
    • Delta
    • Snapshot-based ingestion

Due to this limitation:

  • I am unable to implement the initial historical load using append flow (once=True)
  • This blocks a standard CDC ingestion pattern (Initial Load + Incremental CDC)

Expected Behavior / Feature Request

It would be very helpful if DLT-META could support:

  • Batch ingestion from Parquet (and potentially other file formats)
  • Compatibility with append flows using once=True for backfill scenarios
  • A unified pattern to support:
    • Initial historical load
    • Continuous CDC ingestion

Questions

  1. Is this functionality currently supported in any way within DLT-META that I may have missed?
  2. Are there any recommended workarounds for implementing this pattern except using apply changes?
  3. Is there any plan to include this capability in the DLT-META roadmap?

Additional Context


Contribution

If this feature is not yet supported and roadmap, I would be happy to:

  • Contribute to the implementation
  • Collaborate on design or testing

Impact

This feature would enable:

  • Standardized CDC ingestion patterns
  • Better support for CDC sources like SAP DI, Kafka, Cloud Sources, event hubs, kinesis.
  • Greater flexibility in handling historical data loads

Thanks in advance for your guidance and support!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions