Skip to content

Reproducibility fixes, metric layer, and a data-quality audit#2

Open
ErenAta16 wants to merge 14 commits into
mainlp:mainfrom
ErenAta16:chore/metrics-reproducibility
Open

Reproducibility fixes, metric layer, and a data-quality audit#2
ErenAta16 wants to merge 14 commits into
mainlp:mainfrom
ErenAta16:chore/metrics-reproducibility

Conversation

@ErenAta16

@ErenAta16 ErenAta16 commented Jun 23, 2026

Copy link
Copy Markdown

This PR targets the public snapshot of the repository and groups a set of
reproducibility-oriented changes into small, reviewable pieces. Where it modifies existing
pipeline files, it may overlap with a newer local version; those parts can be cherry-picked
as appropriate.

Summary

The released code did not run as published, and the four evaluation metrics were not
included. This PR restores runnability, adds a self-contained metric layer with tests, and
provides tooling to audit the released dataset and to run a small end-to-end reproduction.

Changes

Runnability and security

  • API keys are read from the environment instead of being hard-coded; adds .env.example
    and .gitignore.
  • Fixes the import and syntax errors that prevented the pipeline from importing: the
    entity_extraction import, an undefined path, and the shared analysis.py helper that the
    analysis_* modules depend on.

Metric layer (new, self-contained)

  • metrics.py, data_loading.py, and run_metrics.py compute the four metrics
    (granularity, diversity, culture specificity, culture consensus) directly from the released
    dataset, following the paper definitions and the metric type exclusions.
  • tests/ contains unit checks and a Figure 2 reproduction. The Figure 2 consensus value
    (0.5) is not derivable from the entity sets printed in the figure, which yield a Jaccard of
    0.25; that single case is marked xfail rather than forced, and the assertion can be
    updated if the intended sets differ.

Fidelity and quality tooling

  • validate_fidelity.py re-derives selected paper observations from the released data.
  • quality_report.py produces a data-quality report covering missing-QID rate, language
    mismatch, degenerate generations, and surface-form/QID fragmentation.
  • smoke_generation.py and compare_to_published.py run a small end-to-end reproduction via
    the Together API.

Documentation and hygiene

  • Adds requirements.txt, a license, a README reproduction section, docs/DEVIATIONS.md,
    and a results summary with figures in docs/RESULTS.md.

Findings (details in docs/RESULTS.md)

  • The mode-collapse observation for DeepSeek (English / books / United States) reproduces
    from the released data: a single title accounts for 99.7% of generations.
  • Restricted to cultural entity types, the missing-QID rate is close to Table 10 for some
    topics (beverage, transportation) and higher for others (clothing, book, food, music).
  • Language mismatch is concentrated in Qwen and Mistral on non-English prompts, consistent
    with Appendix B.

Open question

Does the Hugging Face dataset correspond to the exact version used for the paper, or to an
earlier snapshot? Several measurements above — for example the per-topic missing-QID rates
and an N of 374 rather than 500 in the DeepSeek / United States / books cell — suggest the
released data may differ from the paper run, which would account for the gaps.

Notes

  • The README correction updates "13 LLMs" to "7 LLMs" to match the paper (Table 1). The same
    count appears in the dataset card.
  • The smoke reproduction is a pipeline-integrity check rather than a faithful run: it uses a
    Together Turbo serving endpoint and a proxy extraction model, so its comparison is
    directional only.

@ErenAta16 ErenAta16 changed the title Add metric reproducibility tools and fidelity checks Reproducibility fixes, metric layer, and a data-quality audit Jun 24, 2026
@ErenAta16 ErenAta16 force-pushed the chore/metrics-reproducibility branch from 93e63e1 to 36759c4 Compare June 24, 2026 16:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant