init: manifold implementation#192
Conversation
|
dbb3f5d is new behaviour (tested with logs against old) but seems like quite the low-hanging fruit to save a ton of IO (8 mins of runtime for update job vs 28mins in ideal conditions for new code) Edit: Actually, this was the intended behaviour behind the current prod code too (here), but there seems to be a bug: # Regenerate resolution files in case they've been deleted
resolved_files = gcp.storage.list_with_prefix(
bucket_name=env.QUESTION_BANK_BUCKET, prefix=source
)
filename = f"{row['id']}.jsonl"
if filename not in resolved_files:
market = _get_market(row["id"])
_create_resolution_file(dfq, index, market)here, |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f2a49e7752
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| existing_df=existing_df, | ||
| ) | ||
| if df_res is not None: | ||
| resolution_files[row["id"]] = df_res |
There was a problem hiding this comment.
Avoid uploading unchanged resolution files
When _build_resolution_file returns existing_df for an already up-to-date unresolved market, this line adds that unchanged DataFrame to resolution_files before the identity check below. As a result, func_manifold_update still passes it to _source_io.upload_resolution_files, so nightly runs where most Manifold files are current rewrite every unresolved resolution file despite logging that the write was skipped, increasing job time and bucket writes. Only add the file in the df_res is not existing_df branch.
Useful? React with 👍 / 👎.
|
@claude review |
|
@houtanb I ran the full pipeline (I cancelled resolve and leaderboard after they started not to waste resources, but they start fine). polymarket fetch job failed due to a timeout - seems to have been fetching stuff fine. Not sure what causes this at this point - I recall some issues with some providers? Did not run acled as per previous agreement. |
In addition to the usual read/write patterns being updated during this refactor, note the updated bitwise OR update in
search_markets- avoids having to passidsaround (and reduces, perhaps negligibly, memory footprint).Note the second commit. I've found the fetch's 30s timeout was too quick during some testing in cases of high io in the GCP buckets, so bumped to 60s but I bumped further to match
max_timefrom backoff.Parity testing: old/new code matches perfectly when run together, with both fetch jobs finishing in ~10s, while for update jobs, the old code ran in 26mins vs 6mins for new code (see comment below). Both jobs' output differ ever so slightly compared to last night's prod data in my run (1 new id out of 2130, and 3 values slightly different out of 1981 values), all within expected deviations for a market source.
Full pipeline test: not done as per Slack discussion