Reproducibility fixes, metric layer, and a data-quality audit#2
Open
ErenAta16 wants to merge 14 commits into
Open
Reproducibility fixes, metric layer, and a data-quality audit#2ErenAta16 wants to merge 14 commits into
ErenAta16 wants to merge 14 commits into
Conversation
93e63e1 to
36759c4
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR targets the public snapshot of the repository and groups a set of
reproducibility-oriented changes into small, reviewable pieces. Where it modifies existing
pipeline files, it may overlap with a newer local version; those parts can be cherry-picked
as appropriate.
Summary
The released code did not run as published, and the four evaluation metrics were not
included. This PR restores runnability, adds a self-contained metric layer with tests, and
provides tooling to audit the released dataset and to run a small end-to-end reproduction.
Changes
Runnability and security
.env.exampleand
.gitignore.entity_extractionimport, an undefined path, and the sharedanalysis.pyhelper that theanalysis_*modules depend on.Metric layer (new, self-contained)
metrics.py,data_loading.py, andrun_metrics.pycompute the four metrics(granularity, diversity, culture specificity, culture consensus) directly from the released
dataset, following the paper definitions and the metric type exclusions.
tests/contains unit checks and a Figure 2 reproduction. The Figure 2 consensus value(0.5) is not derivable from the entity sets printed in the figure, which yield a Jaccard of
0.25; that single case is marked
xfailrather than forced, and the assertion can beupdated if the intended sets differ.
Fidelity and quality tooling
validate_fidelity.pyre-derives selected paper observations from the released data.quality_report.pyproduces a data-quality report covering missing-QID rate, languagemismatch, degenerate generations, and surface-form/QID fragmentation.
smoke_generation.pyandcompare_to_published.pyrun a small end-to-end reproduction viathe Together API.
Documentation and hygiene
requirements.txt, a license, a README reproduction section,docs/DEVIATIONS.md,and a results summary with figures in
docs/RESULTS.md.Findings (details in docs/RESULTS.md)
from the released data: a single title accounts for 99.7% of generations.
topics (beverage, transportation) and higher for others (clothing, book, food, music).
with Appendix B.
Open question
Does the Hugging Face dataset correspond to the exact version used for the paper, or to an
earlier snapshot? Several measurements above — for example the per-topic missing-QID rates
and an N of 374 rather than 500 in the DeepSeek / United States / books cell — suggest the
released data may differ from the paper run, which would account for the gaps.
Notes
count appears in the dataset card.
Together Turbo serving endpoint and a proxy extraction model, so its comparison is
directional only.