add(eco): new model and transformation rules by zubeydecivelek · Pull Request #521 · CERNDocumentServer/cds-migrator-kit

zubeydecivelek · 2026-05-05T13:13:41Z

DUMPS

# Posters
inveniomigrator dump records -q '980__a:POSTER -595__a:Press -980:DELETED -980:HIDDEN -980__c:MIGRATED -980__a:DUMMY' --file-prefix eco-posters --chunk-size=1000

# Official Press Brochures
inveniomigrator dump records -q '980__a:BROCHURE 690C_a:CERNOFFICIALPRESSBROCHURE -595__a:Press -980:DELETED -980:HIDDEN -980__c:MIGRATED -980__a:DUMMY' --file-prefix eco-official-press-brochures --chunk-size=1000

# ECO Notes
inveniomigrator dump records -q '980:NOTE 710__5:IR -595__a:Press -980:DELETED -980:HIDDEN -980__c:MIGRATED -980__a:DUMMY' --file-prefix eco-notes --chunk-size=1000

Experiment brochures excluded since they'll be migrated with experiment collections

# Experiment Brochures
inveniomigrator dump records -q '980__a:BROCHURE 690C_a:CERNEXPERIMENTBROCHURE -595__a:Press -980:DELETED -980:HIDDEN -980__c:MIGRATED -980__a:DUMMY' --file-prefix eco-experiment-brochures --chunk-size=1000

inveniomigrator dump records -q '980__a:CMSOUTREACH 6531_a:Brochure -595__a:Press -980:DELETED -980:HIDDEN -980__c:MIGRATED -980__a:DUMMY' --file-prefix cms-experiment-brochures --chunk-size=1000

zubeydecivelek · 2026-05-05T13:15:48Z

    _note = force_list(value.get("a", ""))
    _note_z = force_list(value.get("z", ""))
-    notes_list = _note_z + _note
+    subject_notes = force_list(value.get("s", ""))


example records:

https://cds.cern.ch/record/1745542

https://cds.cern.ch/record/2059632

these look like subjects not notes. I would rather place it in subjects

zubeydecivelek · 2026-05-05T13:18:27Z

+    # TODO: handle photo identifier
+    if scheme == "phopho":
+        id_value = StringValue(value.get("a", "")).parse()
+        new_id = {"scheme": "photo", "identifier": id_value}
+        raise IgnoreKey("eco_identifiers")


How should we handle photo identifiers?

example record: https://cds.cern.ch/record/43679

it should be linked in related identifiers with Photo resource type

zubeydecivelek · 2026-05-05T13:19:01Z

+    # TODO: are they correct?
+    mapping = {
+        "poster": {"id": "poster"},
+        "brochure": {"id": "publication-brochure"},
+        "cmsoutreach": {"id": "publication-brochure"},
+        "note": {"id": "publication-technicalnote"},
+        "conferencepaper": {"id": "publication-conferencepaper"},
+        "lhcb_misc": {"id": "publication-other"},
+        "atlasslide": {"id": "publication-other"},
+    }


Are they correct?

CMS, ATLAS and LHCb content should not happen in this set, please filter is out and remove from the code here

zubeydecivelek · 2026-05-05T13:21:02Z

+def related_ids(self, key, value):
+    """Translated related links."""
+    # TODO: how to transform? https://cds.cern.ch/record/1452204/export/xm
+    related_link = value.get("u", "")
+    if not related_link:
+        journal(self, key, value)
+        raise IgnoreKey("related_ids")


Some records has url as identifier some of them has meeting/conference info

this is example with conference and it's url: https://cds.cern.ch/record/1452204/export/xm

it should be going to related identifiers with schema URL

kpsherva · 2026-05-08T09:09:41Z

        # n = script catalogued or via submission
-        if source not in ["n", "h", "m", "r"]:
-            raise UnexpectedValue(subfield="s", field=key, value=value)
+        if source not in ["n", "h", "m", "r", "d"]:


what does d mean?

311 record has d in the source field. I checked but couldnt find the meaning of d. Maybe digitized? I'll add a question to curation sheet.
Some example recids: 43247, 43430, 824753, 1221556

kpsherva · 2026-05-08T09:09:55Z

        },
+        "paper": {
+            "relation_type": {"id": "references"},
+            # TODO: https://cds.cern.ch/record/2948638/export/xm


yes, i'll remove it. Just to give an example

kpsherva · 2026-05-08T09:10:47Z

+model.over("additional_titles", "(^246_[1_])", override=True)(
+    additional_titles_bulletin
+)
+
+model.over("additional_descriptions", "(^500__)")(additional_descriptions)
+model.over("additional_descriptions", "(^590__)")(translated_description)
+model.over("internal_notes", "^562__")(internal_notes)
+model.over("contributors", "^901__")(organisation)


why aren't we using the noes from base here? Why do they need to be imported here? It shoulnd't be necessary

they're not in the base model. And some records missing 245 but they have 246, so we can import from bulletin to use 246 as title if 245 missing. Or if you prefer I can add missing title records to curation list.

kpsherva · 2026-05-08T09:12:33Z

+    return contributor[0]
+
+
+@model.over("eco_report_number", "(^037__)|(^088__)", override=True)


same question, why do you reimplement it ?

to handle emails in 088, since 037 and 088 implemented in the same rule in base, only overriding 088 is not working

kpsherva · 2026-05-08T09:13:21Z

+    scheme = original_scheme.lower()
+
+    # TODO: handle photo identifier
+    if scheme == "phopho":


hmm I am a bit worried about this... why do we get photo identifiers there? is this a relation?

some examples:

https://cds.cern.ch/record/1221555/

https://cds.cern.ch/record/1221560

kpsherva · 2026-05-08T09:14:16Z

+    raise IgnoreKey("eco_identifiers")
+
+
+@model.over("eco_urls", "^8564[1_]", override=True)


isn't this also reimplementation?

yes but some records have the url in subfield q not in u.

how about adding a parameter to the original function to indicate which subfield should be used?

kpsherva · 2026-05-08T09:16:49Z

+        "PUBLATLASSLIDE",
+        "POSTER",
+        "PREPRINT",
+        "CERNOFFICIALPRESSBROCHURE",


press we said to exclude as well

we're excluding this collection right?
https://cds.cern.ch/collection/Official%20Press%20Brochures?ln=en

yes anything that has tag CERNOFFICIALPRESSBROCHURE

kpsherva · 2026-05-08T09:17:20Z

+        "PRIVATLAS",
+        "PUBLATLASSLIDE",
+        "POSTER",
+        "PREPRINT",


do you have example preprints? it is quite unlikely we have research content in this data set

there's only one record with preprint:
https://cds.cern.ch/record/2675049/export/xm

it should be checked by the curators if it is really preprint. If not, the tag should be removed both t=from the record and from the code here

kpsherva · 2026-05-08T09:28:41Z

+    raise IgnoreKey("submitter_info")
+
+
+@model.over("languages", "^041__", override=True)


why do you override the base function for the language?

most of the records have 2 languages in the same 041__a field like "eng-fre" or "eng/fre". I think it's make sense to handle like this, otherwise it'll be a lot to curate.

example: https://cds.cern.ch/record/921930/export/xm

ok, makes sense to leave it like this in ECO. Let's monitor in the future though if we should move it to base rules.

by the way, minor comment, after splitting the values by / you could pass each value from the split list to base_languages function (as value argument, with specific format) without reimplementing the pycountry lookup

kpsherva · 2026-05-08T09:29:15Z

+        for lang in language_codes:
+            if not lang:
+                continue
+            if lang == "fre":


if the override is done it is only because of this, then please ask the library to correct this value instead of repeating the code or fix it in base directly. Please avoid duplication of the code

around 300 record has 2 languages in the same 041__a field. If you think this will be the case for any other collection I can fix it in base rules. I dont think it's makes sense to ask library since it's easy for us and they'll need to update lots of records. Do you prefer to fix it in base? Or I can simplify it in eco rules too to have less duplicated code

I prefer to fix it in the base since it is a recurring anomaly

kpsherva · 2026-05-08T09:30:05Z

+    value = dict(value)
+    affiliation = value.get("u", "").strip()
+    # Some records have "-" as affiliation: 1614471, 1953712
+    if affiliation and affiliation == "-":


we shold remove these values from MARC instead of reimplementing the function

Isnt it faster/easier to ignore these values during migration instead of fixing the MARC?

the problem is that we can't handle all of the corner cases because it makes the code less readable and possibility of a mistake higher, each time we are adding a conditional statement

kpsherva

there are some places for improvement. Overall, I think you should try avoid reimplementing existing code whenever possible

zubeydecivelek linked an issue May 5, 2026 that may be closed by this pull request

Q2 simple collections to analyse #386

Open

1 task

zubeydecivelek commented May 5, 2026

View reviewed changes

zubeydecivelek requested a review from kpsherva May 5, 2026 13:21

zubeydecivelek added this to Sprint Q2 2026 ☀️ May 6, 2026

zubeydecivelek moved this to In review 🔍 in Sprint Q2 2026 ☀️ May 6, 2026

zubeydecivelek force-pushed the eco branch from c038a8f to fef71cb Compare May 7, 2026 12:23