Skip to content

fix(bench): Update DOOP zxing dataset URL to HuggingFace mirror#506

Merged
justinjoy merged 1 commit intomainfrom
fix/doop-dataset-url
Apr 16, 2026
Merged

fix(bench): Update DOOP zxing dataset URL to HuggingFace mirror#506
justinjoy merged 1 commit intomainfrom
fix/doop-dataset-url

Conversation

@justinjoy
Copy link
Copy Markdown
Collaborator

Summary

  • The original DOOP zxing dataset host (pages.cs.wisc.edu/~m0riarty) now returns 404 and is no longer maintained, so bench/data/doop/download.sh cannot fetch the dataset.
  • Point the script at the FlowLog VLDB 2026 artifact mirror on HuggingFace (NemoYuu/flowlog_benchmark), which the upstream project now uses for distribution.
  • Harden the script so similar breakage is caught immediately rather than producing misleading cannot find zipfile directory failures.

Script hardening

  • Honour DOOP_ZXING_URL to override the source without editing the file.
  • curl --fail so HTTP errors abort the script.
  • Validate the archive with unzip -tq before extraction; if the server returned HTML (e.g. a 404 page), surface a clear error.
  • Move the tmpdir/zip cleanup into a trap so partial downloads are removed even on early exit.

Verification

  • Ran bash bench/data/doop/download.sh from a clean state: 34 CSV files (~83 MB) extracted successfully.
  • Full validation pipeline: bash scripts/run_doop_validation.sh --workers 1 --repeat 1 returned OK (6,276,653 output tuples, 28 iterations, 136.3 s wall, ~16.4 GB peak RSS).

Test plan

  • CI fetches the dataset on a build host without the previous 404.
  • DOOP_ZXING_URL=<bad-url> bash bench/data/doop/download.sh fails fast with the new error message rather than an unzip diagnostic.
  • scripts/run_doop_validation.sh still reports PASS against the newly downloaded dataset.

The original host (pages.cs.wisc.edu/~m0riarty) returns 404 and is no
longer maintained.  Point at the FlowLog VLDB 2026 artifact mirror on
HuggingFace (NemoYuu/flowlog_benchmark), which the upstream project
now uses for dataset distribution.

Also harden the script:
- Honour DOOP_ZXING_URL to override the source without editing the file
- Use curl --fail so HTTP errors abort the script
- Validate the archive with unzip -tq before extraction, so a future
  broken mirror surfaces a clear error instead of a misleading
  "cannot find zipfile directory" failure
- Move tmpdir/zip cleanup into a trap so partial downloads are removed
  even when the script exits early
@justinjoy justinjoy merged commit 41bf119 into main Apr 16, 2026
3 checks passed
@justinjoy justinjoy deleted the fix/doop-dataset-url branch April 16, 2026 12:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant