Reproducibility fixes, metric layer, and a data-quality audit by ErenAta16 · Pull Request #2 · mainlp/MAKIEval

ErenAta16 · 2026-06-23T16:56:29Z

This PR targets the public snapshot of the repository and groups a set of
reproducibility-oriented changes into small, reviewable pieces. Where it modifies existing
pipeline files, it may overlap with a newer local version; those parts can be cherry-picked
as appropriate.

Summary

The released code did not run as published, and the four evaluation metrics were not
included. This PR restores runnability, adds a self-contained metric layer with tests, and
provides tooling to audit the released dataset and to run a small end-to-end reproduction.

Changes

Runnability and security

API keys are read from the environment instead of being hard-coded; adds .env.example
and .gitignore.
Fixes the import and syntax errors that prevented the pipeline from importing: the
entity_extraction import, an undefined path, and the shared analysis.py helper that the
analysis_* modules depend on.

Metric layer (new, self-contained)

metrics.py, data_loading.py, and run_metrics.py compute the four metrics
(granularity, diversity, culture specificity, culture consensus) directly from the released
dataset, following the paper definitions and the metric type exclusions.
tests/ contains unit checks and a Figure 2 reproduction. The Figure 2 consensus value
(0.5) is not derivable from the entity sets printed in the figure, which yield a Jaccard of
0.25; that single case is marked xfail rather than forced, and the assertion can be
updated if the intended sets differ.

Fidelity and quality tooling

validate_fidelity.py re-derives selected paper observations from the released data.
quality_report.py produces a data-quality report covering missing-QID rate, language
mismatch, degenerate generations, and surface-form/QID fragmentation.
smoke_generation.py and compare_to_published.py run a small end-to-end reproduction via
the Together API.

Documentation and hygiene

Adds requirements.txt, a license, a README reproduction section, docs/DEVIATIONS.md,
and a results summary with figures in docs/RESULTS.md.

Findings (details in docs/RESULTS.md)

The mode-collapse observation for DeepSeek (English / books / United States) reproduces
from the released data: a single title accounts for 99.7% of generations.
Restricted to cultural entity types, the missing-QID rate is close to Table 10 for some
topics (beverage, transportation) and higher for others (clothing, book, food, music).
Language mismatch is concentrated in Qwen and Mistral on non-English prompts, consistent
with Appendix B.

Open question

Does the Hugging Face dataset correspond to the exact version used for the paper, or to an
earlier snapshot? Several measurements above — for example the per-topic missing-QID rates
and an N of 374 rather than 500 in the DeepSeek / United States / books cell — suggest the
released data may differ from the paper run, which would account for the gaps.

Notes

The README correction updates "13 LLMs" to "7 LLMs" to match the paper (Table 1). The same
count appears in the dataset card.
The smoke reproduction is a pipeline-integrity check rather than a faithful run: it uses a
Together Turbo serving endpoint and a proxy extraction model, so its comparison is
directional only.

ErenAta16 added 10 commits June 23, 2026 19:45

chore: move API keys to environment

8a358f3

fix: restore entity extraction client initialization

3d2114d

feat: add metric reproducibility tools

22b1d9e

docs: use neutral fidelity report wording

5443ca2

fix: restore runnable experiment imports

4121652

fix: restore shared Wikidata analysis helper

056ca96

feat: add published data quality audit

dae7bec

feat: add Together smoke reproduction flow

ce455ad

docs: document Together reproduction workflow

43d8a39

fix: scope missing QID audit to cultural entities

bcfd61b

ErenAta16 changed the title ~~Add metric reproducibility tools and fidelity checks~~ Reproducibility fixes, metric layer, and a data-quality audit Jun 24, 2026

ErenAta16 added 4 commits June 24, 2026 19:04

fix: label specificity as N/A when origin lookup is absent

1cdb294

docs: add result figures and audit summary

0daa501

docs: mark smoke specificity as N/A to match comparison logic

a4fab91

chore: add MIT license

36759c4

ErenAta16 force-pushed the chore/metrics-reproducibility branch from 93e63e1 to 36759c4 Compare June 24, 2026 16:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reproducibility fixes, metric layer, and a data-quality audit#2

Reproducibility fixes, metric layer, and a data-quality audit#2
ErenAta16 wants to merge 14 commits into
mainlp:mainfrom
ErenAta16:chore/metrics-reproducibility

ErenAta16 commented Jun 23, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ErenAta16 commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Findings (details in docs/RESULTS.md)

Open question

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ErenAta16 commented Jun 23, 2026 •

edited

Loading