Migrate Configs and Rules to Pydantic#1259
Conversation
Migrate storage configs, ChecksValidationStatus and the DQRule family to Pydantic v2 BaseModel. WorkspaceConfig/RunConfig and their nested config tree stay as dataclasses because databricks-labs-blueprint's Installation load/save dispatches on dataclasses.is_dataclass() and __dataclass_fields__. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
- Revert location/columns type widening that broke mypy (str|None cascaded
into checks_storage.py; widened columns broke manager.py and get_rules).
Empty/None location now rejected in @model_validator(mode='before') so the
DQX-specific InvalidConfigError/InvalidParameterError is still raised before
Pydantic coercion; None elements in columns rejected the same way.
- Declare _expected_rule_type/_alternative_rules as ClassVar on DQRuleTypeMixin
so subclass ClassVar overrides no longer trip mypy.
- Use plain mutable field defaults ({}/[]) instead of Field(default_factory=...)
(Pydantic v2 deep-copies per instance) to clear pylint FieldInfo no-member.
- Use self.__class__ is not X instead of type(self) for exact-type guards.
- Add pydantic to uv.lock via make lock-dependencies (was missing from the
databricks-labs-dqx dependency block).
make lint (black, ruff, mypy, pylint 10/10) and 1134 unit tests pass.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
mwojtyczka
left a comment
There was a problem hiding this comment.
Generally looking good, left some comments.
We will need follow up PRs for pydantic to cover serialization and validation. That would bring a real value. It's fine to just have base-class swap in this PR. That's a reasonable incremental step; routing validation/serialization through Pydantic is a separate, higher-risk refactor that shouldn't bloat this one.
…ace, aliased ChecksValidationStatus.errors, model_copy in lakebase test)
…ttable field (extra=forbid) and add validating BaseChecksStorageConfig.replace(); use it in lakebase test instead of validation-skipping model_copy
|
Recap of changes: 1. Added a small 2. Good catch — 3. lakebase test — Rather than Also did a completeness sweep / added coverage:
|
|
What is the planned approach for serialisation? that's the part where actually pydantic support matterns the most. In dqx we have few serializers, we have yaml mode, where yamls need to be serialized to normal validators, also we have delta serializaer for storage, and probablt few more I am not aware/slipped my mind. This serializers needs considered before making changes in the models, so that we dont deadlock ourselves with potentially bad decisions. |
|
Another topic, since we have now DQRules as pdyantic, have you tried serializing them into It might - and probably will trip - on some internal non serializable types, so this needs to be verified to make sure it works e2e |
|
In case of bad user input, all the errors will now be totally different, if pydantic model is throwing it will thrown its own I don't see any tests changes regarding that. I.e. putting "blabla" into "int" field, would for sure throw pydantic validation error about "blabla" being not serializable to an int |
we will do this as a follow up PRs: serialization and validation |
You are right: The Callable/Column fields need a custom json_schema / WithJsonSchema annotation — and a test |
Yes, FileChecksStorageConfig raises |
There was a problem hiding this comment.
@fedeflowers I left comments as a follow up requests from @grusin-db , at least 2 of them are valid. Using pydantic for serializaiton can be done as a follow up PR. I kicked tests suite, it seems some tests are failing. Please check.
Changes
Migrate DQX data models from dataclasses to Pydantic v2 for automatic
validation and simpler YAML serialization/deserialization.
Migrated to
pydantic.BaseModel:checks_validator.py—ChecksValidationStatus(errors held in a realaliased field so it stays constructor-settable)
config.py— 7 storage-config classes;__post_init__→@model_validator(mode="after")withtype(self) is ...guards for themultiple-inheritance composite;
cached_property→propertyrule.py—DQRule,DQRowRule,DQDatasetRule,DQForEachColRule→frozen Pydantic models (
ConfigDict(frozen=True, arbitrary_types_allowed=True)for Spark
Callable/Columnfields); class constants markedClassVarpyproject.toml— addpydantic>=2.8.2Intentionally left as dataclasses:
WorkspaceConfig,RunConfig, and thenested config tree, because databricks-labs-blueprint's
Installation.load()relies on
dataclasses.is_dataclass()/__dataclass_fields__, which Pydanticmodels don't satisfy.
Linked issues
Resolves #467
Tests