Skip to content

add(eco): new model and transformation rules#521

Open
zubeydecivelek wants to merge 1 commit into
CERNDocumentServer:masterfrom
zubeydecivelek:eco
Open

add(eco): new model and transformation rules#521
zubeydecivelek wants to merge 1 commit into
CERNDocumentServer:masterfrom
zubeydecivelek:eco

Conversation

@zubeydecivelek
Copy link
Copy Markdown
Contributor

@zubeydecivelek zubeydecivelek commented May 5, 2026

DUMPS

# Posters
inveniomigrator dump records -q '980__a:POSTER -595__a:Press -980:DELETED -980:HIDDEN -980__c:MIGRATED -980__a:DUMMY' --file-prefix eco-posters --chunk-size=1000
# Official Press Brochures
inveniomigrator dump records -q '980__a:BROCHURE 690C_a:CERNOFFICIALPRESSBROCHURE -595__a:Press -980:DELETED -980:HIDDEN -980__c:MIGRATED -980__a:DUMMY' --file-prefix eco-official-press-brochures --chunk-size=1000
# ECO Notes
inveniomigrator dump records -q '980:NOTE 710__5:IR -595__a:Press -980:DELETED -980:HIDDEN -980__c:MIGRATED -980__a:DUMMY' --file-prefix eco-notes --chunk-size=1000

Experiment brochures excluded since they'll be migrated with experiment collections

# Experiment Brochures
inveniomigrator dump records -q '980__a:BROCHURE 690C_a:CERNEXPERIMENTBROCHURE -595__a:Press -980:DELETED -980:HIDDEN -980__c:MIGRATED -980__a:DUMMY' --file-prefix eco-experiment-brochures --chunk-size=1000

inveniomigrator dump records -q '980__a:CMSOUTREACH 6531_a:Brochure -595__a:Press -980:DELETED -980:HIDDEN -980__c:MIGRATED -980__a:DUMMY' --file-prefix cms-experiment-brochures --chunk-size=1000

@zubeydecivelek zubeydecivelek linked an issue May 5, 2026 that may be closed by this pull request
1 task
_note = force_list(value.get("a", ""))
_note_z = force_list(value.get("z", ""))
notes_list = _note_z + _note
subject_notes = force_list(value.get("s", ""))
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these look like subjects not notes. I would rather place it in subjects

Comment on lines +101 to +105
# TODO: handle photo identifier
if scheme == "phopho":
id_value = StringValue(value.get("a", "")).parse()
new_id = {"scheme": "photo", "identifier": id_value}
raise IgnoreKey("eco_identifiers")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How should we handle photo identifiers?

example record: https://cds.cern.ch/record/43679

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be linked in related identifiers with Photo resource type

Comment on lines +137 to +146
# TODO: are they correct?
mapping = {
"poster": {"id": "poster"},
"brochure": {"id": "publication-brochure"},
"cmsoutreach": {"id": "publication-brochure"},
"note": {"id": "publication-technicalnote"},
"conferencepaper": {"id": "publication-conferencepaper"},
"lhcb_misc": {"id": "publication-other"},
"atlasslide": {"id": "publication-other"},
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are they correct?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMS, ATLAS and LHCb content should not happen in this set, please filter is out and remove from the code here

Comment thread cds_migrator_kit/rdm/records/transform/xml_processing/rules/eco.py Outdated
Comment on lines +182 to +188
def related_ids(self, key, value):
"""Translated related links."""
# TODO: how to transform? https://cds.cern.ch/record/1452204/export/xm
related_link = value.get("u", "")
if not related_link:
journal(self, key, value)
raise IgnoreKey("related_ids")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some records has url as identifier some of them has meeting/conference info

this is example with conference and it's url: https://cds.cern.ch/record/1452204/export/xm

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be going to related identifiers with schema URL

# n = script catalogued or via submission
if source not in ["n", "h", "m", "r"]:
raise UnexpectedValue(subfield="s", field=key, value=value)
if source not in ["n", "h", "m", "r", "d"]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does d mean?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

311 record has d in the source field. I checked but couldnt find the meaning of d. Maybe digitized? I'll add a question to curation sheet.
Some example recids: 43247, 43430, 824753, 1221556

},
"paper": {
"relation_type": {"id": "references"},
# TODO: https://cds.cern.ch/record/2948638/export/xm
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo left?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, i'll remove it. Just to give an example

Comment on lines +42 to +49
model.over("additional_titles", "(^246_[1_])", override=True)(
additional_titles_bulletin
)

model.over("additional_descriptions", "(^500__)")(additional_descriptions)
model.over("additional_descriptions", "(^590__)")(translated_description)
model.over("internal_notes", "^562__")(internal_notes)
model.over("contributors", "^901__")(organisation)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why aren't we using the noes from base here? Why do they need to be imported here? It shoulnd't be necessary

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

they're not in the base model. And some records missing 245 but they have 246, so we can import from bulletin to use 246 as title if 245 missing. Or if you prefer I can add missing title records to curation list.

Comment thread cds_migrator_kit/rdm/records/transform/xml_processing/rules/eco.py Outdated
return contributor[0]


@model.over("eco_report_number", "(^037__)|(^088__)", override=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question, why do you reimplement it ?

Copy link
Copy Markdown
Contributor Author

@zubeydecivelek zubeydecivelek May 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to handle emails in 088, since 037 and 088 implemented in the same rule in base, only overriding 088 is not working

scheme = original_scheme.lower()

# TODO: handle photo identifier
if scheme == "phopho":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm I am a bit worried about this... why do we get photo identifiers there? is this a relation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise IgnoreKey("eco_identifiers")


@model.over("eco_urls", "^8564[1_]", override=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this also reimplementation?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes but some records have the url in subfield q not in u.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about adding a parameter to the original function to indicate which subfield should be used?

Comment thread cds_migrator_kit/rdm/records/transform/xml_processing/rules/eco.py Outdated
Comment thread cds_migrator_kit/rdm/records/transform/xml_processing/rules/eco.py Outdated
Comment thread cds_migrator_kit/rdm/records/transform/xml_processing/rules/eco.py Outdated
Comment thread cds_migrator_kit/rdm/records/transform/xml_processing/rules/eco.py Outdated
Comment thread cds_migrator_kit/rdm/records/transform/xml_processing/rules/eco.py Outdated
"PUBLATLASSLIDE",
"POSTER",
"PREPRINT",
"CERNOFFICIALPRESSBROCHURE",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

press we said to exclude as well

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes anything that has tag CERNOFFICIALPRESSBROCHURE

"PRIVATLAS",
"PUBLATLASSLIDE",
"POSTER",
"PREPRINT",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have example preprints? it is quite unlikely we have research content in this data set

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's only one record with preprint:
https://cds.cern.ch/record/2675049/export/xm

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should be checked by the curators if it is really preprint. If not, the tag should be removed both t=from the record and from the code here

Comment thread cds_migrator_kit/rdm/records/transform/xml_processing/rules/eco.py Outdated
raise IgnoreKey("submitter_info")


@model.over("languages", "^041__", override=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you override the base function for the language?

Copy link
Copy Markdown
Contributor Author

@zubeydecivelek zubeydecivelek May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

most of the records have 2 languages in the same 041__a field like "eng-fre" or "eng/fre". I think it's make sense to handle like this, otherwise it'll be a lot to curate.

example: https://cds.cern.ch/record/921930/export/xm

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, makes sense to leave it like this in ECO. Let's monitor in the future though if we should move it to base rules.

by the way, minor comment, after splitting the values by / you could pass each value from the split list to base_languages function (as value argument, with specific format) without reimplementing the pycountry lookup

for lang in language_codes:
if not lang:
continue
if lang == "fre":
Copy link
Copy Markdown
Contributor

@kpsherva kpsherva May 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the override is done it is only because of this, then please ask the library to correct this value instead of repeating the code or fix it in base directly. Please avoid duplication of the code

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

around 300 record has 2 languages in the same 041__a field. If you think this will be the case for any other collection I can fix it in base rules. I dont think it's makes sense to ask library since it's easy for us and they'll need to update lots of records. Do you prefer to fix it in base? Or I can simplify it in eco rules too to have less duplicated code

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to fix it in the base since it is a recurring anomaly

value = dict(value)
affiliation = value.get("u", "").strip()
# Some records have "-" as affiliation: 1614471, 1953712
if affiliation and affiliation == "-":
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shold remove these values from MARC instead of reimplementing the function

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isnt it faster/easier to ignore these values during migration instead of fixing the MARC?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the problem is that we can't handle all of the corner cases because it makes the code less readable and possibility of a mistake higher, each time we are adding a conditional statement

Copy link
Copy Markdown
Contributor

@kpsherva kpsherva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are some places for improvement. Overall, I think you should try avoid reimplementing existing code whenever possible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Q2 simple collections to analyse

2 participants