fix: preserve opaque <stem> contents in Sanitizer (MathML round-trip)#117
Open
opoudjis wants to merge 1 commit into
Open
fix: preserve opaque <stem> contents in Sanitizer (MathML round-trip)#117opoudjis wants to merge 1 commit into
opoudjis wants to merge 1 commit into
Conversation
Closes #116 The recursive sanitiser stripped MathML / AsciiMath / LaTeX from inside <stem> elements because it descended into every child element before checking the ALLOWED allow-list. <stem> itself is ALLOWED, but its children (<math>, <mstyle>, <msub>, <mi>, <mn>, <asciimath>, …) are not, so they were unwrapped to bare text. End-to-end symptom: bibliographic titles with embedded math lost their notation content on YAML round-trip, with only the text nodes surviving alongside their whitespace skeleton. Fix: introduce an OPAQUE set ({<stem>} for now) listing elements whose contents are out-of-band inline notation and must be preserved verbatim. sanitize_children skips recursion into OPAQUE elements after the rename pass, so their inner XML survives untouched. The downstream regression that prompted this is in metanorma/metanorma-generic — the spec "Metanorma::Generic::BibdataConfig preserves embedded MathML when deserialising a bibdata title" fails until this lands and a new relaton-bib release ships. Three new sanitizer specs cover: - MathML inner elements survive inside <stem> (do not unwrap to text) - <stem> attributes plus deeply nested MathML survive together - siblings of <stem> are still sanitised; <stem>'s opacity is local Assertions use include-style matches because Nokogiri's serialiser may reflow whitespace around nested elements; the semantic claim is "inner elements survive, not just their text content".
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #116
See the linked issue for the full diagnosis, reproducer, and root-cause walk-through. This PR is the proposed fix.
Change
lib/relaton/bib/sanitizer.rb:OPAQUE = %w[stem].freezeset, listing elements whose children are non-basicdoc inline notation (MathML, AsciiMath, LaTeX, …) and must be preserved verbatim rather than sanitised againstALLOWED.sanitize_childrenshort-circuits before recursing into OPAQUE elements (after theRENAMEpass, before the recursive descent), so their inner XML survives the walk untouched.spec/relaton/bib/sanitizer_spec.rb:<stem>cases — MathML inner elements survive,<stem>attributes plus deeply nested MathML survive, and siblings of<stem>are still sanitised (opacity is local).include-style matches because Nokogiri's serialiser may reflow whitespace around nested elements; the semantic claim is "inner elements survive, not just their text content".Verification
bundle exec rspec spec/relaton/bib/sanitizer_spec.rb→ 36 examples, 0 failures.Target branch
lutaml-integration— the 2.x line lives on this branch (mainis still 1.x, last tagged v1.20.8).Downstream
The regression that prompted this fix is in
metanorma/metanorma-generic— the specMetanorma::Generic::BibdataConfig preserves embedded MathML when deserialising a bibdata titlefails until this lands and a newrelaton-bibrelease ships. NoGemfile.develpointer is being added tometanorma-genericbecause there is nometanorma-genericrelease pending this cycle; the metanorma-generic spec stays red until the next release-cycle gem bump.🤖