Skip to content

Migrate Configs and Rules to Pydantic#1259

Open
fedeflowers wants to merge 8 commits into
databrickslabs:mainfrom
fedeflowers:feature/pydantic-migration
Open

Migrate Configs and Rules to Pydantic#1259
fedeflowers wants to merge 8 commits into
databrickslabs:mainfrom
fedeflowers:feature/pydantic-migration

Conversation

@fedeflowers

Copy link
Copy Markdown
Contributor

Changes

Migrate DQX data models from dataclasses to Pydantic v2 for automatic
validation and simpler YAML serialization/deserialization.

Migrated to pydantic.BaseModel:

  • checks_validator.pyChecksValidationStatus (errors held in a real
    aliased field so it stays constructor-settable)
  • config.py — 7 storage-config classes; __post_init__
    @model_validator(mode="after") with type(self) is ... guards for the
    multiple-inheritance composite; cached_propertyproperty
  • rule.pyDQRule, DQRowRule, DQDatasetRule, DQForEachColRule
    frozen Pydantic models (ConfigDict(frozen=True, arbitrary_types_allowed=True)
    for Spark Callable/Column fields); class constants marked ClassVar
  • pyproject.toml — add pydantic>=2.8.2

Intentionally left as dataclasses: WorkspaceConfig, RunConfig, and the
nested config tree, because databricks-labs-blueprint's Installation.load()
relies on dataclasses.is_dataclass() / __dataclass_fields__, which Pydantic
models don't satisfy.

Linked issues

Resolves #467

Tests

  • manually tested
  • added unit tests
  • added integration tests
  • added end-to-end tests
  • added performance tests

fedeflowers and others added 3 commits June 19, 2026 21:07
Migrate storage configs, ChecksValidationStatus and the DQRule family to
Pydantic v2 BaseModel. WorkspaceConfig/RunConfig and their nested config tree
stay as dataclasses because databricks-labs-blueprint's Installation load/save
dispatches on dataclasses.is_dataclass() and __dataclass_fields__.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Revert location/columns type widening that broke mypy (str|None cascaded
  into checks_storage.py; widened columns broke manager.py and get_rules).
  Empty/None location now rejected in @model_validator(mode='before') so the
  DQX-specific InvalidConfigError/InvalidParameterError is still raised before
  Pydantic coercion; None elements in columns rejected the same way.
- Declare _expected_rule_type/_alternative_rules as ClassVar on DQRuleTypeMixin
  so subclass ClassVar overrides no longer trip mypy.
- Use plain mutable field defaults ({}/[]) instead of Field(default_factory=...)
  (Pydantic v2 deep-copies per instance) to clear pylint FieldInfo no-member.
- Use self.__class__ is not X instead of type(self) for exact-type guards.
- Add pydantic to uv.lock via make lock-dependencies (was missing from the
  databricks-labs-dqx dependency block).

make lint (black, ruff, mypy, pylint 10/10) and 1134 unit tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@fedeflowers fedeflowers requested a review from a team as a code owner June 20, 2026 23:09
@fedeflowers fedeflowers requested review from nehamilak-db and removed request for a team June 20, 2026 23:09
@mwojtyczka mwojtyczka changed the title Feature/pydantic migration Migrate Configs and Rules to Pydantic Jun 23, 2026
@mwojtyczka mwojtyczka added the under-review This PR is currently being reviewed by one of DQX maintainers. label Jun 23, 2026
@mwojtyczka mwojtyczka self-requested a review June 23, 2026 18:07
Comment thread src/databricks/labs/dqx/rule.py
Comment thread src/databricks/labs/dqx/checks_validator.py Outdated
Comment thread src/databricks/labs/dqx/config.py

@mwojtyczka mwojtyczka left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looking good, left some comments.

We will need follow up PRs for pydantic to cover serialization and validation. That would bring a real value. It's fine to just have base-class swap in this PR. That's a reasonable incremental step; routing validation/serialization through Pydantic is a separate, higher-risk refactor that shouldn't bloat this one.

…ace, aliased ChecksValidationStatus.errors, model_copy in lakebase test)
…ttable field (extra=forbid) and add validating BaseChecksStorageConfig.replace(); use it in lakebase test instead of validation-skipping model_copy
@fedeflowers

Copy link
Copy Markdown
Contributor Author

Recap of changes:

1. engine.pydataclasses.replace in _filter_for_original_columns_preselection

Added a small DQRule.replace() helper and call check.replace(check_func_kwargs=rule_kwargs) instead of model_copy(update=...). The reason for not using model_copy directly: it (a) does not re-run validators — DQRule's mode="after" validator validates attributes and initialises the name — and (b) shallow-copies the instance dict, so it carries over already-cached functools.cached_property values (rule_fingerprint, columns_as_string_expr). replace() rebuilds through the constructor, so validation re-runs and derived state is recomputed from the updated fields. Quick demonstration of the difference:

r = DQRowRule(name="x", check_func=is_not_null, column="id")
_ = r.rule_fingerprint  # populate the cached_property
r.model_copy(update={"column": "y"}).rule_fingerprint == r.rule_fingerprint  # True  (stale)
r.replace(column="y").rule_fingerprint            == r.rule_fingerprint       # False (recomputed)

2. checks_validator.py_errors not constructor-settable

Good catch — PrivateAttr (and the interim aliased-field variant) were both more trouble than they were worth. Made errors a plain public field instead: errors: list[str] = [], which is constructor-settable as ChecksValidationStatus(errors=[...]); has_errors stays a @property. Updated the one call site (test_app_backend.py:1873) from _errors=[...] to errors=[...], so the test once again actually exercises the error path. Also added model_config = ConfigDict(extra="forbid") so a stray/unknown kwarg now raises ValidationError loudly instead of being silently dropped. This also removed the Field(...) that was tripping pylint's no-member on self.errors.append, so make lint is green again.

3. lakebase test — dataclasses.replace on LakebaseChecksStorageConfig

Rather than model_copy(update={"mode": ...}), I added a validating BaseChecksStorageConfig.replace() helper (mirroring DQRule.replace()) and switched the test to config.replace(mode=...), and dropped the import dataclasses. The reason for not using model_copy: it skips all validators, so an invalid override (e.g. a bad mode or a non-3-part location) would be accepted silently and only blow up later inside the save path. replace() rebuilds through the constructor so the storage-config validators re-run — verified:

c = LakebaseChecksStorageConfig(location="a.b.c", instance_name="inst")
c.model_copy(update={"mode": "BOGUS"}).mode  # 'BOGUS'           (accepted silently)
c.replace(mode="BOGUS")                       # raises InvalidConfigError

Also did a completeness sweep / added coverage:

  • Grepped the repo for other dataclasses.replace/asdict on migrated models — the remaining occurrences all target classes still defined as @dataclass (WorkspaceConfig, workflow Task, DQProfile, anomaly ExplanationContext), so they're unaffected.
  • Added unit tests for DQRule.replace() (override + field preservation, cached-fingerprint recomputation, validator re-run) and BaseChecksStorageConfig.replace() (override + field preservation, validator re-run).
  • make lint green (10.00/10); affected unit tests pass (validator + storage configs + rule + app dry-run). Ran the affected integration paths: has_valid_schema (the check.replace preselection path) — 21 passed, 1 skipped; rule-fingerprint apply — 4 passed.

@fedeflowers fedeflowers requested a review from mwojtyczka June 23, 2026 22:30
@grusin-db

grusin-db commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

What is the planned approach for serialisation? that's the part where actually pydantic support matterns the most.

In dqx we have few serializers, we have yaml mode, where yamls need to be serialized to normal validators, also we have delta serializaer for storage, and probablt few more I am not aware/slipped my mind. This serializers needs considered before making changes in the models, so that we dont deadlock ourselves with potentially bad decisions.

@grusin-db

grusin-db commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Another topic, since we have now DQRules as pdyantic, have you tried serializing them into json_schema? there is .model_json_schema(), this should be included in testing, to ensure it passes.

It might - and probably will trip - on some internal non serializable types, so this needs to be verified to make sure it works e2e

@grusin-db

Copy link
Copy Markdown
Collaborator

In case of bad user input, all the errors will now be totally different, if pydantic model is throwing it will thrown its own ValidationError, which in turn will trip all validation error tests we have, hence that needs to be covered.

I don't see any tests changes regarding that. I.e. putting "blabla" into "int" field, would for sure throw pydantic validation error about "blabla" being not serializable to an int

@mwojtyczka

Copy link
Copy Markdown
Contributor

What is the planned approach for serialisation? that's the part where actually pydantic support matterns the most.

In dqx we have few serializers, we have yaml mode, where yamls need to be serialized to normal validators, also we have delta serializaer for storage, and probablt few more I am not aware/slipped my mind. This serializers needs considered before making changes in the models, so that we dont deadlock ourselves with potentially bad decisions.

we will do this as a follow up PRs: serialization and validation

@mwojtyczka

mwojtyczka commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Another topic, since we have now DQRules as pdyantic, have you tried serializing them into json_schema? there is .model_json_schema(), this should be included in testing, to ensure it passes.

It might - and probably will trip - on some internal non serializable types, so this needs to be verified to make sure it works e2e

You are right:
DQRule.model_json_schema() → PydanticInvalidForJsonSchema: Cannot generate a JsonSchema for core_schema.CallableSchema
DQRowRule.model_json_schema() → same
FileChecksStorageConfig → OK
ChecksValidationStatus → OK

The Callable/Column fields need a custom json_schema / WithJsonSchema annotation — and a test

@mwojtyczka

Copy link
Copy Markdown
Contributor

In case of bad user input, all the errors will now be totally different, if pydantic model is throwing it will thrown its own ValidationError, which in turn will trip all validation error tests we have, hence that needs to be covered.

I don't see any tests changes regarding that. I.e. putting "blabla" into "int" field, would for sure throw pydantic validation error about "blabla" being not serializable to an int

Yes, FileChecksStorageConfig raises pydantic_core.ValidationError. So programmatic construction with wrong-typed fields now surfaces Pydantic's error instead of DQX's own. The tests pass because the dict-level ChecksValidator still catches metadata errors before construction — but the model-construction path has no test coverage for the new error shape. I think DQX should catch/wrap ValidationError into its own error types for a consistent API rather than leak pydantic.ValidationError.

@mwojtyczka mwojtyczka left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fedeflowers I left comments as a follow up requests from @grusin-db , at least 2 of them are valid. Using pydantic for serializaiton can be done as a follow up PR. I kicked tests suite, it seems some tests are failing. Please check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

under-review This PR is currently being reviewed by one of DQX maintainers.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE]: Migrate data models to Pydantic

3 participants