Skip to content

refactor(dedup): retire defensive keep="last" on ArcticDB reads (L2188)#206

Merged
cipher813 merged 1 commit into
mainfrom
refactor/retire-defensive-dedup-l2188
May 28, 2026
Merged

refactor(dedup): retire defensive keep="last" on ArcticDB reads (L2188)#206
cipher813 merged 1 commit into
mainfrom
refactor/retire-defensive-dedup-l2188

Conversation

@cipher813
Copy link
Copy Markdown
Owner

Summary

Remove the defensive keep=\"last\" index dedup from both the training read path (data/dataset.py::_load_ticker_parquet) and the inference read path (inference/stages/load_prices.py::_read_ohlcv + _read_close). Add source-level pin tests so the dead defense can't silently re-appear.

Why

The dedup was a 2026-04-15 transition device added to mask same-date duplicate rows ArcticDB was emitting at the time. The upstream fix (alpha-engine-data builders/daily_append.pyupdate() over append()) shipped the same day, and 6+ weeks of clean Saturday + weekday cycles have verified the write path is now duplicate-free.

Per [[feedback_no_silent_fails]]: a silent keep=\"last\" on values-differ duplicates would mask an upstream write-path regression as a clean read. The right surface for any future re-recurrence is the downstream pandas reindex raising — which fails the pipeline loudly at the first compute_features call site.

ROADMAP: L2188Phase 7a follow-up: remove defensive keep=\"last\" dedup.

Out of scope

data/dataset.py:215 retains .duplicated(keep=\"first\") for a different purpose (cross-source merge first-wins semantics, not defensive dedup). Intentionally left alone.

Test plan

  • Source-level pins (TestDefensiveDedupRetiredL2188) block re-introduction
  • Full suite green: 1216 passed
  • Next Sat 5/30 PredictorTraining-in-SF — first organic exercise on the simplified read path

🤖 Generated with Claude Code

Remove the defensive `keep="last"` index dedup from both the training
read path (`data/dataset.py::_load_ticker_parquet`) and the inference
read path (`inference/stages/load_prices.py::_read_ohlcv` + `_read_close`).

The dedup was a 2026-04-15 transition device added to mask same-date
duplicate rows ArcticDB was emitting at the time. The upstream fix
(alpha-engine-data builders/daily_append.py → `update()` over `append()`)
shipped same day, and 6+ weeks of clean Saturday + weekday cycles have
verified the write path is now duplicate-free.

Per feedback_no_silent_fails: a silent `keep="last"` on values-differ
duplicates would mask an upstream write-path regression as a clean read.
The right surface for any future re-recurrence is the downstream pandas
reindex raising — which fails the pipeline loudly at the first feature
computation site (`compute_features`).

Tests:
- `TestDefensiveDedupRetiredL2188::test_dataset_loader_does_not_dedup`
  — source-level pin that `_load_ticker_parquet` carries no `.duplicated(`
  call.
- `TestDefensiveDedupRetiredL2188::test_inference_readers_do_not_dedup`
  — same pin for `load_price_data_from_arctic`.

Note: `data/dataset.py:215` retains `.duplicated(keep="first")` for a
DIFFERENT purpose (cross-source merge first-wins semantics, not
defensive dedup); intentionally out of scope.

Suite: 1216 passed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cipher813 cipher813 merged commit 286c3c7 into main May 28, 2026
1 check passed
@cipher813 cipher813 deleted the refactor/retire-defensive-dedup-l2188 branch May 28, 2026 15:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant