add(eco): new model and transformation rules#521
Conversation
| _note = force_list(value.get("a", "")) | ||
| _note_z = force_list(value.get("z", "")) | ||
| notes_list = _note_z + _note | ||
| subject_notes = force_list(value.get("s", "")) |
There was a problem hiding this comment.
example records:
There was a problem hiding this comment.
these look like subjects not notes. I would rather place it in subjects
| # TODO: handle photo identifier | ||
| if scheme == "phopho": | ||
| id_value = StringValue(value.get("a", "")).parse() | ||
| new_id = {"scheme": "photo", "identifier": id_value} | ||
| raise IgnoreKey("eco_identifiers") |
There was a problem hiding this comment.
How should we handle photo identifiers?
example record: https://cds.cern.ch/record/43679
There was a problem hiding this comment.
it should be linked in related identifiers with Photo resource type
| # TODO: are they correct? | ||
| mapping = { | ||
| "poster": {"id": "poster"}, | ||
| "brochure": {"id": "publication-brochure"}, | ||
| "cmsoutreach": {"id": "publication-brochure"}, | ||
| "note": {"id": "publication-technicalnote"}, | ||
| "conferencepaper": {"id": "publication-conferencepaper"}, | ||
| "lhcb_misc": {"id": "publication-other"}, | ||
| "atlasslide": {"id": "publication-other"}, | ||
| } |
There was a problem hiding this comment.
Are they correct?
There was a problem hiding this comment.
CMS, ATLAS and LHCb content should not happen in this set, please filter is out and remove from the code here
| def related_ids(self, key, value): | ||
| """Translated related links.""" | ||
| # TODO: how to transform? https://cds.cern.ch/record/1452204/export/xm | ||
| related_link = value.get("u", "") | ||
| if not related_link: | ||
| journal(self, key, value) | ||
| raise IgnoreKey("related_ids") |
There was a problem hiding this comment.
Some records has url as identifier some of them has meeting/conference info
this is example with conference and it's url: https://cds.cern.ch/record/1452204/export/xm
There was a problem hiding this comment.
it should be going to related identifiers with schema URL
| # n = script catalogued or via submission | ||
| if source not in ["n", "h", "m", "r"]: | ||
| raise UnexpectedValue(subfield="s", field=key, value=value) | ||
| if source not in ["n", "h", "m", "r", "d"]: |
There was a problem hiding this comment.
311 record has d in the source field. I checked but couldnt find the meaning of d. Maybe digitized? I'll add a question to curation sheet.
Some example recids: 43247, 43430, 824753, 1221556
| }, | ||
| "paper": { | ||
| "relation_type": {"id": "references"}, | ||
| # TODO: https://cds.cern.ch/record/2948638/export/xm |
There was a problem hiding this comment.
yes, i'll remove it. Just to give an example
| model.over("additional_titles", "(^246_[1_])", override=True)( | ||
| additional_titles_bulletin | ||
| ) | ||
|
|
||
| model.over("additional_descriptions", "(^500__)")(additional_descriptions) | ||
| model.over("additional_descriptions", "(^590__)")(translated_description) | ||
| model.over("internal_notes", "^562__")(internal_notes) | ||
| model.over("contributors", "^901__")(organisation) |
There was a problem hiding this comment.
why aren't we using the noes from base here? Why do they need to be imported here? It shoulnd't be necessary
There was a problem hiding this comment.
they're not in the base model. And some records missing 245 but they have 246, so we can import from bulletin to use 246 as title if 245 missing. Or if you prefer I can add missing title records to curation list.
| return contributor[0] | ||
|
|
||
|
|
||
| @model.over("eco_report_number", "(^037__)|(^088__)", override=True) |
There was a problem hiding this comment.
same question, why do you reimplement it ?
There was a problem hiding this comment.
to handle emails in 088, since 037 and 088 implemented in the same rule in base, only overriding 088 is not working
| scheme = original_scheme.lower() | ||
|
|
||
| # TODO: handle photo identifier | ||
| if scheme == "phopho": |
There was a problem hiding this comment.
hmm I am a bit worried about this... why do we get photo identifiers there? is this a relation?
There was a problem hiding this comment.
| raise IgnoreKey("eco_identifiers") | ||
|
|
||
|
|
||
| @model.over("eco_urls", "^8564[1_]", override=True) |
There was a problem hiding this comment.
isn't this also reimplementation?
There was a problem hiding this comment.
yes but some records have the url in subfield q not in u.
There was a problem hiding this comment.
how about adding a parameter to the original function to indicate which subfield should be used?
| "PUBLATLASSLIDE", | ||
| "POSTER", | ||
| "PREPRINT", | ||
| "CERNOFFICIALPRESSBROCHURE", |
There was a problem hiding this comment.
press we said to exclude as well
There was a problem hiding this comment.
we're excluding this collection right?
https://cds.cern.ch/collection/Official%20Press%20Brochures?ln=en
There was a problem hiding this comment.
yes anything that has tag CERNOFFICIALPRESSBROCHURE
| "PRIVATLAS", | ||
| "PUBLATLASSLIDE", | ||
| "POSTER", | ||
| "PREPRINT", |
There was a problem hiding this comment.
do you have example preprints? it is quite unlikely we have research content in this data set
There was a problem hiding this comment.
there's only one record with preprint:
https://cds.cern.ch/record/2675049/export/xm
There was a problem hiding this comment.
it should be checked by the curators if it is really preprint. If not, the tag should be removed both t=from the record and from the code here
| raise IgnoreKey("submitter_info") | ||
|
|
||
|
|
||
| @model.over("languages", "^041__", override=True) |
There was a problem hiding this comment.
why do you override the base function for the language?
There was a problem hiding this comment.
most of the records have 2 languages in the same 041__a field like "eng-fre" or "eng/fre". I think it's make sense to handle like this, otherwise it'll be a lot to curate.
There was a problem hiding this comment.
ok, makes sense to leave it like this in ECO. Let's monitor in the future though if we should move it to base rules.
by the way, minor comment, after splitting the values by / you could pass each value from the split list to base_languages function (as value argument, with specific format) without reimplementing the pycountry lookup
| for lang in language_codes: | ||
| if not lang: | ||
| continue | ||
| if lang == "fre": |
There was a problem hiding this comment.
if the override is done it is only because of this, then please ask the library to correct this value instead of repeating the code or fix it in base directly. Please avoid duplication of the code
There was a problem hiding this comment.
around 300 record has 2 languages in the same 041__a field. If you think this will be the case for any other collection I can fix it in base rules. I dont think it's makes sense to ask library since it's easy for us and they'll need to update lots of records. Do you prefer to fix it in base? Or I can simplify it in eco rules too to have less duplicated code
There was a problem hiding this comment.
I prefer to fix it in the base since it is a recurring anomaly
| value = dict(value) | ||
| affiliation = value.get("u", "").strip() | ||
| # Some records have "-" as affiliation: 1614471, 1953712 | ||
| if affiliation and affiliation == "-": |
There was a problem hiding this comment.
we shold remove these values from MARC instead of reimplementing the function
There was a problem hiding this comment.
Isnt it faster/easier to ignore these values during migration instead of fixing the MARC?
There was a problem hiding this comment.
the problem is that we can't handle all of the corner cases because it makes the code less readable and possibility of a mistake higher, each time we are adding a conditional statement
kpsherva
left a comment
There was a problem hiding this comment.
there are some places for improvement. Overall, I think you should try avoid reimplementing existing code whenever possible
DUMPS
Experiment brochures excluded since they'll be migrated with experiment collections