Skip to content

init: manifold implementation#192

Open
nikbpetrov wants to merge 6 commits into
forecastingresearch:mainfrom
nikbpetrov:manifold
Open

init: manifold implementation#192
nikbpetrov wants to merge 6 commits into
forecastingresearch:mainfrom
nikbpetrov:manifold

Conversation

@nikbpetrov
Copy link
Copy Markdown
Collaborator

@nikbpetrov nikbpetrov commented May 16, 2026

In addition to the usual read/write patterns being updated during this refactor, note the updated bitwise OR update in search_markets - avoids having to pass ids around (and reduces, perhaps negligibly, memory footprint).

Note the second commit. I've found the fetch's 30s timeout was too quick during some testing in cases of high io in the GCP buckets, so bumped to 60s but I bumped further to match max_time from backoff.

Parity testing: old/new code matches perfectly when run together, with both fetch jobs finishing in ~10s, while for update jobs, the old code ran in 26mins vs 6mins for new code (see comment below). Both jobs' output differ ever so slightly compared to last night's prod data in my run (1 new id out of 2130, and 3 values slightly different out of 1981 values), all within expected deviations for a market source.

Full pipeline test: not done as per Slack discussion

@nikbpetrov
Copy link
Copy Markdown
Collaborator Author

nikbpetrov commented May 16, 2026

dbb3f5d is new behaviour (tested with logs against old) but seems like quite the low-hanging fruit to save a ton of IO (8 mins of runtime for update job vs 28mins in ideal conditions for new code)

Edit: Actually, this was the intended behaviour behind the current prod code too (here), but there seems to be a bug:

# Regenerate resolution files in case they've been deleted
        resolved_files = gcp.storage.list_with_prefix(
            bucket_name=env.QUESTION_BANK_BUCKET, prefix=source
        )
        filename = f"{row['id']}.jsonl"
        if filename not in resolved_files:
            market = _get_market(row["id"])
            _create_resolution_file(dfq, index, market)

here, resolved_files has a prefix, while filename does not, resulting in a resolution file always being created and uploaded, and thus excessive writing to GCP.

@nikbpetrov nikbpetrov requested a review from houtanb May 16, 2026 18:49
@houtanb
Copy link
Copy Markdown
Member

houtanb commented May 18, 2026

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: f2a49e7752

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/sources/manifold.py
existing_df=existing_df,
)
if df_res is not None:
resolution_files[row["id"]] = df_res
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid uploading unchanged resolution files

When _build_resolution_file returns existing_df for an already up-to-date unresolved market, this line adds that unchanged DataFrame to resolution_files before the identity check below. As a result, func_manifold_update still passes it to _source_io.upload_resolution_files, so nightly runs where most Manifold files are current rewrite every unresolved resolution file despite logging that the write was skipped, increasing job time and bucket writes. Only add the file in the df_res is not existing_df branch.

Useful? React with 👍 / 👎.

@houtanb
Copy link
Copy Markdown
Member

houtanb commented May 18, 2026

@claude review

@nikbpetrov
Copy link
Copy Markdown
Collaborator Author

nikbpetrov commented May 18, 2026

@houtanb I ran the full pipeline (I cancelled resolve and leaderboard after they started not to waste resources, but they start fine).

polymarket fetch job failed due to a timeout - seems to have been fetching stuff fine. Not sure what causes this at this point - I recall some issues with some providers?

Did not run acled as per previous agreement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants