feat(signals): signal extraction with calibrated confidence and manipulation flags#3
Merged
Merged
Conversation
…action Land the binding data contracts between every layer of the extraction pipeline — MarketSnapshot, FeatureVector, MarketSignal, SignalContext, RelatedMarketState — plus the closed enums SignalType, ManipulationFlag, ConsumerType, and InterpretationMode. Field sets mirror docs/contracts/schema-and-versioning.md verbatim; all models are frozen and reject unknown fields so a producer cannot silently add a member that a consumer does not recognize. MarketSignal carries a model_validator that rejects any instance whose raw_features lacks a non-empty calibration_provenance string. This is the project-wide invariant from the development plan (§7.2): no uncalibrated signal escapes the producer. Tests cover both the missing-key and empty-string cases. uuid7-based signal identifiers preserve time ordering, which lets the bus and storage layers sort by identifier and still recover temporal order — a prerequisite for byte-identical backtest replay. scripts/export_schemas.py registers the four shipped models and emits their JSON schemas to schemas/*.json with deterministic key ordering. The --check mode is now load-bearing: a modification to any model shape fails CI until the export is regenerated.
Introduce the ingestion seam: AbstractPoller (two concrete implementations — Polymarket and Kalshi), the RawMarketData / RawOrderBook / RawTrade DTOs that pollers emit, and the normalizer that turns those DTOs into the canonical MarketSnapshot. All platform-specific field mapping lives in the pollers and the normalizer. PolymarketPoller and KalshiPoller each wrap an aiohttp.ClientSession and route every request through the shared exponential-backoff helper (initial 1 s, cap 60 s, max 5 retries per docs/architecture/adaptive-polling-spec.md §Backoff Policy). The KalshiPoller fails loud at construction when KALSHI_API_KEY is absent rather than surfacing a credential error on first call. The normalizer (1) verbatim-preserves question, resolution_source, resolution_criteria, (2) computes spread from bid/ask when both present, (3) derives liquidity from top-5 levels of the order book, (4) stamps schema_version. MalformedPayloadError is raised on missing required keys; there is no silent default. Tests cover the retry success / retry / exhaust paths (with injectable sleep so the suite runs under 100 ms), normalization of both Polymarket and Kalshi shaped payloads, and the missing-price failure mode.
The scheduler is the state machine in docs/architecture/adaptive-polling-spec.md: each market sits in one of four tiers (hot, warm, cool, cold) with asymmetric promotion and demotion thresholds against volume_ratio_1h. The hysteresis bands (±10 % around each switch point) prevent a market sitting near a threshold from flapping between tiers on consecutive ticks, which would corrupt rolling-window features whose semantics depend on consistent temporal sampling. The tier enum drives the polling interval via interval_seconds; a downstream poller loop reads the current tier and schedules the next tick accordingly. Markets closing within 24 h promote from cool to warm regardless of volume; an active recent signal promotes from warm to hot. Rate-limit pressure is fed back via observe_platform_pressure; above 80 % utilization the scheduler emits a RateLimitPressureEvent and demotes the hot market with the lowest volume_ratio_1h so the platform regains headroom without starving the most-active markets. PollingConfig, PollingBody, HysteresisBands, PlatformCaps, and BackoffSettings are frozen Pydantic models mirroring the TOML schema in config/polling.toml. Tests cover the full promotion / demotion chain, the hysteresis band, closes-within-24h promotion, and the rate-limit demotion path.
… ewma The pipeline sits between normalized snapshots and the detectors, producing a FeatureVector per market per tick. Per-market state includes a bounded SnapshotBuffer (default 500 snapshots, ~4 hours at 30 s polling), an EWMA baseline over volume_24h with alpha 0.05, and a rolling estimate of the polling interval. Window semantics follow docs/architecture/adaptive-polling-spec.md §Wall-Clock vs Observation-Count Window Reconciliation. Windows are stored as observation counts internally and the 5m / 15m / 1h / 4h labels are derived at compute time from the current polling-interval estimate. When a market changes polling tier, the window size recomputes next tick — the feature is "volatility of the samples we have", not of an unobserved continuous process. EWMA updates are halt-aware: when the gap since the last observation exceeds 2x the expected interval, the decay multiplier applies (1 - alpha)^gap_factor so the baseline does not freeze through the gap. This prevents a mid-window halt from masking the post-halt volume surge. Indicators are pure functions taking a snapshot sequence and returning float | None for underdetermined cases; no hidden state. Tests cover idempotency (same buffer, same vector), warmup behavior, the EWMA halt-decay path, and the boundary behavior of bid/ask and spread when one side is missing.
…face SignalDetector is the protocol every detector implements: warmup, ingest(market, feature, snapshot, now), state_dict / load_state for checkpointing, and reset. The ingest signature makes now a parameter rather than reading datetime.now(), which keeps backtest replay bit-for-bit identical to live runs — a prerequisite for calibration fidelity per docs/methodology/calibration-methodology.md. DetectorRegistry dispatches per-market detectors observation-at-a-time and batch detectors (cross-market divergence) across the whole polling cycle. Registration is explicit so the engine composes exactly the set of detectors the configuration enables. DetectorsConfig composes five per-detector sub-models — PriceVelocity, VolumeSpike, BookImbalance, CrossMarket, RegimeShift — mirroring the block shape in config/detectors.toml. Every sub-model carries its resolution_exclusion_seconds default (21600 = 6 h) so the pre- resolution-window invariant is enforced uniformly.
…a bocpd Land the price-velocity detector per the method description in docs/methodology/calibration-methodology.md. The change-point model is Bernoulli-Beta BOCPD (Adams & MacKay 2007) on a binary projection of each price observation against a per-market EWMA of recent prices — sustained deviations drive the posterior run-length distribution's mass onto the short-run bucket, producing a sharp signal precisely when the rate of up-ticks vs down-ticks changes. The run-length distribution is capped at run_length_cap and mass that would otherwise fall off the cap is absorbed back into the cap bucket with a weighted average of the sufficient statistics. Without the absorption, long steady-state streams leak probability mass and produce a slow drift in P(r_t < 5). PriceVelocityDetector enforces three operational invariants inside ingest: (1) the 6 h pre-resolution exclusion — signals inside the window are never returned regardless of posterior magnitude, (2) a per-market cooldown so the same underlying change does not fire repeatedly, (3) a 50-observation warmup so the early run-length distribution (trivially concentrated at r = 0) does not produce spurious signals on a fresh market. liquidity_tier.banding provides the per-snapshot tier estimator against the tier thresholds in docs/foundations/glossary.md. Tests cover the BOCPD math invariants (constant stream drives P(r_t < 5) below the 0.3 noise floor, step change fires within 50 observations, out-of-range observation raises), the detector-level behaviors (pre-resolution exclusion, boundary prices, state round trip, reset), and the flat-stream no-signal property.
…hift detectors Three per-market detectors ship together because they share the stateful-per-market / pre-resolution-excluded / warmup-gated shape and exercise the same dispatch path through the registry. Volume spike maintains per-market EWMA mean and variance of volume_ratio_1h (alpha 0.05 by default). Signals fire when the raw z-score exceeds the configured minimum_z and the absolute 24 h volume is above the per-market floor. The FDR controller layer composes over these raw z-scores once the calibration module lands; the detector itself does not gate on FDR because the controller needs a batch of candidates across markets, not a per-market decision. Book imbalance applies a depth gate before the ratio check so signals do not fire on thin books where an imbalance is more likely a manipulation artifact than a directional view (docs/methodology/manipulation-taxonomy.md §thin_book_during_move). The persistence requirement (default 3 consecutive snapshots) filters transient imbalances. Regime shift uses a two-sided CUSUM on volatility_1h with a per-market dormancy gate. The detector only fires after the dormancy window has elapsed since the last signal (or since initialization), so a sustained increase in volatility following a quiet window is what trips the detector. An adaptive cooldown multiplies the dormancy window by the configured factor after each firing, preventing the same regime from emitting signals repeatedly as volatility continues to rise. All three enforce the 6 h pre-resolution exclusion inside ingest, thread ``now`` as a parameter (no ``datetime.now()`` calls), and populate the calibration_provenance stamp so emitted signals satisfy the MarketSignal model validator.
…onitor, and cross-market detector The calibration layer ships four cooperating modules and a dependent cross-market detector that exercises them end to end. benjamini_hochberg implements the step-up procedure: sort ascending, find the largest rank k with p_(k) ≤ (k/m) q, accept all hypotheses whose p-value is at most p_(k). FDRController wraps the procedure and returns the set of signal_ids that pass when a detector submits a batch of (signal_id, p_value) pairs per polling cycle. This is the shared primitive the cross-market divergence detector and (post-Phase 1) the volume-spike detector gate on. ReliabilityAnalyzer looks up curves by (detector_id, liquidity_tier) and linearly interpolates raw scores onto the cached decile grid. The identity placeholder curve (raw == calibrated, version identity_v0) is returned whenever no empirical curve has been registered yet so signals produced during the warmup period still satisfy the calibration_provenance invariant on MarketSignal. EmpiricalFPR computes FP / (FP + TN) against a labeled event stream per docs/methodology/labeling-protocol.md §True Positive with a 24 h default lead window. The contract is functional today against synthetic labels; real populations wait on the labeling workstream. DriftMonitor computes Population Stability Index and a two-sample Kolmogorov-Smirnov statistic over baseline vs current score populations. The nightly calibrate run invokes the monitor and emits a CalibrationStaleEvent when either metric crosses its threshold. CrossMarketDivergenceDetector operates on batches across the full polling cycle because the FDR controller needs to see every candidate market pair's p-value simultaneously. For each curated related-market pair, the detector computes the current Spearman rho, applies the Fisher-z transform, compares the delta-z to the pair's historical z, and emits a signal when BH-FDR accepts the p-value at the target q. Twelve tests cover BH correctness (ordering invariants, empty input, q validation), FDR controller behavior, reliability identity and registered-curve paths, liquidity tier banding, empirical FPR for true-positive and unlabeled cases, drift monitor triggering on a clear distribution shift vs silence on stable scores, and the cross-market detector firing on a decorrelation event.
…ode metadata The manipulation detector attaches flags to every candidate signal before it reaches the bus. Five pure-function signatures, each authoritative in docs/methodology/manipulation-taxonomy.md, drive the decision: single_counterparty_concentration (Herfindahl on trade volume), size_vs_depth_outlier (one trade consumed > threshold of prior depth), cancel_replace_burst (cancel / replace events exceed threshold in a rolling window), thin_book_during_move (median depth below floor), and pre_resolution_window (signal fired inside the 6 h pre-close exclusion). ManipulationDetector runs every signature against the trades, book events, and snapshots surrounding a candidate signal and returns the matched flags. The list is always present and always a list — never None. The detector does not suppress; consumer policy applies suppression per the taxonomy doc. attach_flags produces a new MarketSignal with the flags set, routed through model_copy so Pydantic re-validates the calibration_provenance invariant. CURATED_EPISODES enumerates five canonical historical cases used as positive-case test fixtures with their expected flag sets; the test suite verifies that the flag coverage across episodes spans the full ManipulationFlag enum, preventing the episode catalogue from drifting out of sync with the taxonomy.
… features, signals The storage layer persists the full pipeline output — snapshots, features, signals, manipulation flags, empirical FPR records, and reliability curves — to a single DuckDB database. The schema mirrors docs/architecture/system-design.md §Storage Schema exactly so queries against the backtest harness produce the same shape the live engine writes. initialize is idempotent: it applies the migration statements in order and stamps the schema_version table. The CREATE TABLE IF NOT EXISTS form lets repeat initializations run without error; the schema_version row uses INSERT OR IGNORE so the applied_at timestamp is written only once. Future migrations append to a numbered list and rerun on startup; no destructive migrations are expected inside a major version. Insert paths accept the frozen Pydantic models directly and serialize JSON payloads (raw_json on snapshots, raw_features on signals, manipulation_flags when non-empty). Read paths return typed model instances by routing rows through Pydantic's model_validate so the calibration_provenance invariant still holds on recovery. Storage is deliberately kept synchronous in this implementation; the engine serializes writes through one connection. The async facade in the scaling workstream drops in with the same method surface.
…storm controller Four cooperating modules between the detector layer and the context assembler, all bounded by the semantics in docs/architecture/deduplication-and-storms.md. InProcessAsyncBus fans published signals out to every current subscriber with a per-subscriber bounded queue and a LIFO drop under pressure. This is the Phase-1 implementation; later phases swap in a NATS / Redis Streams adapter behind the same method surface. FingerprintDedup collapses signals that share (market_id, signal_type, time_bucket) with max magnitude, max confidence, union of manipulation_flags and related_market_ids, earliest detected_at, and the lexicographically-smallest signal_id. The merge_provenance entry in raw_features lists every source signal_id so the backtest can reconstruct the pre-dedup stream. ClusterMerge layers on top of FingerprintDedup and merges signals of the same type firing on taxonomy-related markets inside the cluster window. Only strong edges (positive, inverse, causal) contribute to merging — complex and unknown edges are intentionally excluded because cluster merge asserts a shared cause. StormController tracks raw arrival rate and queue depth against asymmetric trigger / recovery thresholds. Entry on either trigger, exit only when both recovery conditions hold for the recovery window. Storm mode is advisory at this layer; the engine consumes StormState to switch the dedup layer to cluster-only output and suspend the LLM formatter.
…lated-market resolver The context assembler wraps every MarketSignal with verbatim platform metadata, the latest snapshots of related markets, and investigation prompts drawn from a frozen curated library. The output (SignalContext) is the binding contract between extraction and downstream formatters: consumers never see the raw MarketSignal. MarketTaxonomy loads curated edges from TOML, stores them bidirectionally (an edge a <-> b is reachable from both markets), and exposes ``cluster_for`` with filtering by relationship type so the dedup layer's cluster merge only considers strong edges (positive, inverse, causal). complex edges are intentionally excluded from clustering because the causal equivalence they assert is too weak. InvestigationPromptLibrary is frozen-at-construction: duplicate entries raise immediately, the public interface is lookup-only, and a coverage report enumerates the (signal_type, category) tuples with no prompts so startup can log the gaps. The library reads from data/investigation_prompts.toml via a classmethod. RelatedMarketResolver fetches the latest snapshot per related market from DuckDB and computes the 24 h price delta. Markets without recent snapshots are omitted; the 1 h freshness window is configurable. ContextAssembler is a pure function of (signal, store, taxonomy, resolver, prompt library, category map). Two invocations with identical inputs produce byte-identical JSON — the determinism test (assemble twice, compare model_dump_json) exercises this invariant and is one of the gates for the extraction workstream's Definition of Done.
…or determinism The Engine composes the full extraction pipeline: snapshot -> feature pipeline -> detector dispatch -> manipulation evaluation -> fingerprint and cluster dedup -> bus publish -> context assembly. run_cycle takes snapshots, the features for each market, the recent trades and book events used by the manipulation detector, and now — threaded through every downstream call so the backtest harness and live engine traverse the same code with bit-for-bit identical timing. The scripts/lint_detector_now.py AST guard parses every file under src/augur_signals/augur_signals/detectors/ and fails non-zero on any direct datetime.now() call — whether via ``datetime.now()`` or ``datetime.datetime.now()``. The guard is wired into both pre-commit (id: datetime-now-in-detectors) and the CI workflow so the invariant holds through every merged commit. tests/signals/test_engine_integration.py replays a synthetic 160-tick snapshot stream (flat phase followed by a sustained level shift) through the engine and asserts at least one SignalContext is emitted. The test exercises the full composition — detectors, manipulation, fingerprint, cluster, bus, assembler — and is the stand-in for the recorded-API-fixture integration test that lands with the labeling workstream.
Cross-market divergence now keys FDR submissions on the pair ``(market_a, market_b)`` rather than just ``market_a``. Before, a market that participated in multiple related-market pairs collapsed to a single per-market pass/fail on the FDR set return — all pair signals for that market survived even when only one pair's p-value crossed. The pair-level key ensures each pair's decision is independent. Cluster-merge representative selection follows the spec: the highest liquidity tier in the cluster wins, ties break alphabetically by market_id. The prior max-magnitude heuristic contradicted docs/architecture/deduplication-and-storms.md §Cluster-Level Merge. DuckDBStore.signals_in_window now rehydrates manipulation flags from the side table before returning. Signals went to storage with their flags persisted but came back with empty flag lists — a silent correctness hazard for backtests that audit manipulation coverage. RelatedMarketResolver's delta window now defaults to 86_400 seconds (24 hours) so ``RelatedMarketState.delta_24h`` matches its name and its contract in docs/contracts/schema-and-versioning.md. The constructor argument is renamed ``delta_window_seconds`` to make the semantic explicit. StormController now requires the queue-depth trigger to sustain across ``trigger_queue_depth_window_sec`` before entering storm mode, matching docs/architecture/deduplication-and-storms.md §Storm Detection. A single-tick depth spike no longer flips the controller. RegimeShiftDetector's direction now compares magnitudes rather than preferring the positive arm; when both arms cross the threshold in the same tick, the dominant excursion's sign is emitted. compute_empirical_fpr now requires ``now`` as a parameter instead of falling back to ``datetime.now()``. This extends the "now as parameter" invariant beyond detectors to every time-sensitive entry point so backtest runs are deterministic across wall clocks. Two new tests round out coverage: a 100-invocation determinism test for ContextAssembler per the contract in §SignalContext, and a round-trip test verifying manipulation flags survive DuckDB write/read.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Delivers the working signal-extraction pipeline end to end: Pydantic data contracts, ingestion adapters, adaptive polling, feature computation, five detectors, manipulation flagging, calibration primitives (BH-FDR, reliability curves, drift monitor, liquidity banding), deduplication, storm handling, deterministic context assembly, DuckDB storage, and an orchestrator that composes the full cycle. Live signal extraction becomes operational against Polymarket and Kalshi once API credentials are provisioned.
What Changed
MarketSnapshot,FeatureVector,MarketSignal,SignalContext,RelatedMarketState, and the four closed enums.MarketSignalenforcesraw_features["calibration_provenance"]via a model validator; all models are frozen and reject unknown fields.AbstractPollerprotocol,PolymarketPollerandKalshiPollerconcrete implementations, exponential-backoff retry, and the normalizer that maps raw payloads ontoMarketSnapshot.SnapshotBuffer, halt-aware EWMA, and pure-function indicators computed over 5m / 15m / 1h / 4h observation-count windows.CURATED_EPISODESmetadata.MarketTaxonomy, frozenInvestigationPromptLibrary,RelatedMarketResolver, and the deterministicContextAssembler.Engine.run_cyclecomposes the full pipeline;scripts/lint_detector_now.pyrejectsdatetime.now()calls inside detector modules and is wired into pre-commit and CI.How It Works
Detector algorithms follow
docs/methodology/calibration-methodology.md. Manipulation signatures matchdocs/methodology/manipulation-taxonomy.md. The polling state machine, deduplication, and storm handling mirror the specs indocs/architecture/adaptive-polling-spec.mdanddocs/architecture/deduplication-and-storms.md. Every layer is bound by the schemas indocs/contracts/schema-and-versioning.md; producers stampschema_versionon every emitted message and consumers validate against the exported JSON schemas.Configuration Added
config/detectors.toml— parameters for the five detectors (hazard rate, alpha/beta priors, fire thresholds, FDR q, dormancy, CUSUM k/h multipliers, depth gates).data/investigation_prompts.toml— curated (signal_type, category) -> prompt entries consumed by the frozenInvestigationPromptLibrary.No new top-level TOML files were added beyond the scaffolding set; existing
config/detectors.toml,config/polling.toml,config/dedup.toml, anddata/investigation_prompts.tomldrive the runtime via the Pydantic configuration models.Schema Changes
Four new JSON schemas exported under
schemas/:MarketSnapshot-1.0.0.jsonFeatureVector-1.0.0.jsonMarketSignal-1.0.0.jsonSignalContext-1.0.0.jsonAll at schema version
1.0.0perdocs/contracts/schema-and-versioning.md §Versioning Policy. No breaking changes; this is the initial publication of the contract.Quality Gates
uv run ruff check .cleanuv run ruff format --check .cleanuv run mypy --strict src/clean (66 source files)uv run pytest --cov-fail-under=80passes; 124 tests, 90.3 % total coverageuv run python scripts/export_schemas.py --check)datetime.now()AST guard passes (uv run python scripts/lint_detector_now.py)uv run pre-commit run --all-filespassesDefinition of Done
src/augur_signals/and its tests pass lint, format, strict mypy.ContextAssembler.assembleon the same input produce byte-identical JSON.ManipulationFlagenum.--checksucceeds.Engine.run_cycleproduces at least oneSignalContextwhen a step change follows a flat warmup window.grepfor LLM imports returns zero matches.datetime.now()in detector modules passes.CHANGELOG.mdupdated with the full module-level additions.Operational Handoff
After merge the engine can run continuously on a single machine, ingest from Polymarket and Kalshi, emit calibrated
MarketSignalevents persisted to DuckDB, and produce deterministicSignalContextenvelopes ready for the formatter workstream. Real reliability curves require labeled events produced by the labeling workstream; during the warmup period every signal carriescalibration_provenance = "<detector>@identity_v0".Test Plan
uv run pytestpasses locally (124 tests).uv run python scripts/export_schemas.py --checkpasses locally.uv run python scripts/lint_detector_now.pypasses locally.uv run pre-commit run --all-filespasses locally.Engine.run_cycleagainst a recorded Polymarket and Kalshi fixture once recorded and confirm a non-empty context stream.Review Pass
pr-reviewfindings addressed: 4 HIGH + 4 MEDIUM fixed on the branch (3784eaa fix(signals): address pr-review findings in extraction core); 1 HIGH (doc-only) deferred; 5 LOW deferred.code-refinersimplifications applied: none actionable. Review completed clean on the simplification axis.Deferred Findings
docs/contracts/schema-and-versioning.md:114typesMarketSignal.raw_featuresasdict[str, float], butcalibration_provenance,merge_provenance,cluster_member_signal_ids, andrelated_market_idall carry string values in the shipped code. The code is correct; the published contract needs adict[str, float | str]update. Will open a separatefix(docs)PR against the same branch before merge or as a follow-up.if utilization > 0.80— cosmetic, deferred.ReliabilityAnalyzer.calibrateis wired for future detector use; the detectors currently emit raw scores with the@identity_v0provenance. This is the documented Phase-1 behavior.merge_provenancevscluster_member_signal_ids) retained because the dedup stages are distinguishable in the audit trail.test_export_schemas.pydoes not separately testRelatedMarketStatebecause it appears nested underSignalContext; already covered transitively.