Add pluggable event, compute, and database backends to modern stack#94
Conversation
|
Important Review skippedAuto reviews are disabled on base/target branches other than the default branch. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
…lement-event-and-messaging-system-alternatives
fc3287c
into
codex/fix-remaining-issues-and-raise-pr
There was a problem hiding this comment.
Devin Review found 2 new potential issues.
🐛 2 issues in files not directly in the diff
🐛 Duplicate dict keys in architecture() silently discard first distributed_compute and storage entries (pipeline/backend/core/modern_stack.py:195-214)
The architecture() method defines "distributed_compute" twice (lines 195 and 213) and "storage" twice (lines 204 and 214) in the same dict literal. In Python, duplicate keys in a dict literal cause the first value to be silently overwritten by the last one.
Root Cause and Impact
The first "distributed_compute" entry (lines 195-203) contains a hardcoded Ray config with config.alternatives: ["spark", "dask"], and the first "storage" entry (lines 204-212) contains database_alternatives: ["cockroachdb", "mongodb", "cassandra"]. These are silently overwritten by the second definitions at lines 213-214 which use self._distributed_compute() and self._storage() respectively.
The pluggable self._distributed_compute() returns a dict like {"name": "ray", "enabled": True, "config": {"dashboard_url": ..., "entrypoint": ...}} — it does NOT contain config.alternatives.
Similarly, self._storage() returns {"database": {...}, "object_storage": {...}} — it does NOT contain database_alternatives.
This means test_architecture_contains_requested_layers at pipeline/backend/tests/test_modern_stack.py:25-26 would fail at runtime:
architecture["distributed_compute"]["config"]["alternatives"]→KeyErrorbecause the pluggable compute dict doesn't have analternativeskeyarchitecture["storage"]["database_alternatives"]→KeyErrorbecauseself._storage()doesn't have adatabase_alternativeskey
The old hardcoded entries should have been removed when the pluggable self._distributed_compute() and self._storage() calls were added.
🐛 Duplicate compute and event commands emitted in submit_execution() due to old hardcoded logic not being removed (pipeline/backend/core/modern_stack.py:250-276)
submit_execution() appends duplicate distributed_compute and event_layer commands because old hardcoded logic (lines 250-276) was not removed when the new pluggable logic (lines 277-308) was added.
Detailed Explanation
Duplicate compute commands: Lines 250-266 always append a hardcoded compute command — either {"engine": "ray", "action": "submit_ray_job"} if settings.RAY_ENABLED is true, or {"engine": "spark", "action": "submit_spark_job"} otherwise. Then lines 277-290 call self._distributed_compute() and append another compute command based on the DISTRIBUTED_COMPUTE_BACKEND setting. With default settings (RAY_ENABLED=True, DISTRIBUTED_COMPUTE_BACKEND="ray"), the commands list will contain two nearly identical ray compute commands.
Worse, when the user configures DISTRIBUTED_COMPUTE_BACKEND="dask" but RAY_ENABLED=True (the default), the old code still emits a ray command while the new code correctly emits a dask command — resulting in contradictory commands.
Duplicate event commands: Lines 268-276 append a hardcoded kafka event command when settings.KAFKA_ENABLED is true. Then lines 292-308 call self._event_layer() and append another event command based on EVENT_BACKEND. With KAFKA_ENABLED=True and EVENT_BACKEND="kafka" (the default), two kafka event commands are emitted.
The old hardcoded blocks at lines 250-276 should have been removed when the pluggable blocks at lines 277-308 were added.
View 9 additional findings in Devin Review.
Motivation
Description
pipeline/backend/config.pyto allow runtime selection of eventing (EVENT_BACKEND+ Pulsar/RabbitMQ/NATS settings), distributed compute (DISTRIBUTED_COMPUTE_BACKEND+ Spark/Dask settings), and database backend (DATABASE_BACKEND+ CockroachDB/MongoDB/Cassandra settings).ModernOrchestrationStackinpipeline/backend/core/modern_stack.pyto resolve providers through helper methods._event_layer(),._distributed_compute(), and._storage()and to use those resolved components inarchitecture()andsubmit_execution()to generate provider-specific command metadata.pipeline/backend/tests/test_modern_stack.pyto validate selection and command generation for alternative backends (Pulsar+Spark+CockroachDB and RabbitMQ+Dask) usingmonkeypatchto override settings.Testing
pytest -q pipeline/backend/tests/test_modern_stack.pyand all tests passed (4 passed).architecture()andsubmit_execution()for default and alternative backend combinations and succeeded without regressions.Codex Task